Compare commits

..

265 Commits

Author SHA1 Message Date
Patrick Devine
61349a8ec6 tests: move csv output to benstat format 2025-10-26 18:24:35 -07:00
Patrick Devine
b97eb2b858 cloud: set the proxy content-type to the same as local models (#12759) 2025-10-25 10:57:10 -07:00
Jesse Gross
ad6f6a1d29 llm: Change memory allocation backoff from exponential to incremental
If we create a memory layout that should fit based on report free VRAM
but allocation still fails, we start applying a backoff. This reduces
free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
points chosen tend to be too dense at the beginning and too sparse at
the end. Therefore, this switches to an incremental backoff (10%, 20%,
30%...).
2025-10-23 12:58:31 -07:00
Vinh Nguyen
6723a40be6 readme: add VT Code project to terminal community integrations (#12749) 2025-10-23 12:29:50 -07:00
Daniel Hiltgen
3258a89b6e DRY out the runner lifecycle code (#12540)
* DRY out the runner lifecycle code

Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo

* win: make incremental builds better

Place build artifacts in discrete directories so incremental builds don't have to start fresh

* Adjust sort order to consider iGPUs

* handle cpu inference oom scenarios

* review comments
2025-10-23 11:20:02 -07:00
Jesse Gross
1c093e97af kvcache: Remove special case for reservation mask
We currently short circuit generation of the cache mask and just
generate an empty tensor of the correct size. However, in some
cases, this can also skip a cast operation. This can result in the
worst case graph being not fully worst case.

We don't actually need the fast path for mask generation, so it's
better to just use the normal code path.
2025-10-22 17:38:04 -07:00
Jesse Gross
a8d9c2648e llamarunner: Record the time for all batches during prompt processing
Currently, we only record the time for the last batch when processing
the prompt. This results in unrealistically high numbers for the
old llama runner.

Before:
total duration:       31.273112939s
load duration:        4.97054657s
prompt eval count:    32768 token(s)
prompt eval duration: 235.137439ms
prompt eval rate:     139356.80 tokens/s
eval count:           1873 token(s)
eval duration:        18.173182374s
eval rate:            103.06 tokens/s

After:
total duration:       30.024798033s
load duration:        4.758588663s
prompt eval count:    32768 token(s)
prompt eval duration: 7.779621548s
prompt eval rate:     4212.03 tokens/s
eval count:           1769 token(s)
eval duration:        17.148014223s
eval rate:            103.16 tokens/s
2025-10-22 13:52:58 -07:00
frob
0334e67ffd tools: parse tool calls that don't conform to ("name": name, "arguments": args} (#12738) 2025-10-22 11:34:27 -07:00
nicole pardal
e0ead1adee embeddings: base64 encoding fix (#12715) 2025-10-22 11:27:44 -07:00
Patrick Devine
d515aed6c3 cloud: don't error sending empty messages (#12724) 2025-10-21 18:12:14 -07:00
Jeffrey Morgan
5fe7ba1b9b runner: always truncate embeddings requests (#12714) 2025-10-20 16:47:05 -07:00
Michael Yang
d2b63c19b3 fs(ggml): fill in arch prefix if necessary (#12646) 2025-10-20 16:42:18 -07:00
Jeffrey Morgan
94f110b35a model/parsers: remove warning for missing <think> tag for qwen3-vl (#12713) 2025-10-20 16:03:43 -07:00
Daniel Hiltgen
5d22953ba7 cuda: get driver version after props (#12707)
Users on Windows without GPUs are reporting errors relating to
cudaDriverGetVersion with the device set to -1.  This ensures we only grab the
driver once we're enumerating actual devices.
2025-10-20 10:57:27 -07:00
Daniel Hiltgen
d245dffed8 rocm: give it more time to bootstrap (#12681)
Some users are hitting timeouts.  We'd like to make this faster, but for now make sure we don't timeout too aggressively.
2025-10-20 09:43:05 -07:00
Daniel Hiltgen
bc1a818fdc contiguous input per layer (#12686)
Co-authored-by: Michael Yang <git@mxy.ng>
2025-10-17 18:39:18 -07:00
Daniel Hiltgen
ba2253dc30 win: more verbose load failures (#12683)
When loading the dynamic libraries, if something goes wrong report some
details.  Unfortunately this wont explain which dependencies are missing,
but this breadcrumb in the logs should help us diagnose GPU discovery
failures.
2025-10-17 17:13:16 -07:00
Daniel Hiltgen
68e04c7ff8 test: harden scheduler tests (#12662)
* test: harden scheduler tests

This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.

* test: tune tests for partial loads

Give stress tests more time when the model is split between CPU/GPU
2025-10-17 08:56:44 -07:00
Daniel Hiltgen
270679932f cuda: tidy up CC settings (#12668)
8.7 is Jetpack only, so no need on x86 builds
10.3 covers [G]B300
2025-10-16 16:39:30 -07:00
Jeffrey Morgan
65fb3ff49d renderers: add global flag for setting [img] tags (#12669)
Adds a temporary global flag to renderers that causes renderers to always
render images as [img]. In a follow up change, we will consider making this
the default, and this flag could eventually be removed
2025-10-16 16:37:32 -07:00
Grace
e2a0b24435 Grace/qwen3 thinking (#12647)
* changing initial status to take into consideration prefill

* Add seperate strings for content and thinking builder

* thinking tests

* remove white space from string before closing think tag
2025-10-16 15:29:41 -07:00
Daniel Hiltgen
1813ff85a0 cuda: bring back CC 5.2 (#12666)
Forward compat on the newer driver doesn't seem to be working.
This should get 5.2 working on newer drivers again.
2025-10-16 13:07:41 -07:00
Daniel Hiltgen
b531777a66 test: add a few missing embedding models (#12661) 2025-10-16 09:36:25 -07:00
Daniel Hiltgen
fe3ec8dbf0 Revert "Workaround broken NVIDIA iGPU free VRAM data (#12490)" (#12642)
The workaround has been moved into the underlying C++ code.

This reverts commit e4340667e3.
2025-10-16 09:09:48 -07:00
Thomas Stocker
c744134287 vulkan: Get FilterID from Backend for Vulkan (#12655)
* vulkan: Get FilterID from Backend for Vulkan

* Fixing patch
2025-10-16 09:07:35 -07:00
weedge
4be41d2d45 readme: add achatbot-go to community integrations (#12629) 2025-10-15 21:54:15 -07:00
zhetaicheleba
de670570c9 fs/ggml: fix function name in comment (#12630) 2025-10-15 21:53:38 -07:00
Devon Rifkin
201d93716e Merge pull request #12651 from ollama/drifkin/oai-conversion
openai: make tool call conversion fns public
2025-10-15 21:10:30 -07:00
Devon Rifkin
160cecc8e2 openai: make tool call conversion fns public 2025-10-15 20:54:58 -07:00
Daniel Hiltgen
8b6e5baee7 CI: Set up temporary opt-out Vulkan support (#12614)
Initially Vulkan support in Ollama will require building from source.  Once it is
more thoroughly tested and we have fixed any critical bugs, then we can
bundle Vulkan into the official binary releases.
2025-10-15 14:18:01 -07:00
Daniel Hiltgen
75d17fc6c2 perf: backport cuda iGPU sched spin (#12641) 2025-10-15 11:52:14 -07:00
Santosh Bhavani
8fafc8af77 ml/backend/ggml: NVML fallback for unified memory GPUs (#12619)
* Simplify NVML fallback for unified memory GPUs

Remove device-specific checks and environment variable dependency for
NVML_ERROR_NOT_SUPPORTED fallback. When NVML doesn't support memory
queries, unconditionally use /proc/meminfo instead of checking device
names or OLLAMA_UNIFIED_MEMORY environment variable.

This provides better memory reporting by using MemAvailable which
accounts for reclaimable memory, avoiding the underreporting issue
described in NVIDIA support article a_id/5728.

Tested on NVIDIA GB10 unified memory iGPU with consistent and accurate
memory reporting across multiple model load/unload cycles.

* Add NVML fallback patch for unified memory GPUs
2025-10-15 11:40:06 -07:00
Jesse Gross
c3c85aa06c llm: Enable flash attention by default for gemma3 2025-10-15 10:42:12 -07:00
Jeffrey Morgan
0d713051a2 envconfig: default to port 443 when connecting to ollama.com (#12617) 2025-10-14 23:38:24 -07:00
Parth Sareen
c4c5a4a01e types: send index for tool calls (#12625) 2025-10-14 19:35:15 -07:00
Jesse Gross
3dcfd5f69e llm: Perform eviction when num_gpu is set with new estimates
Currently, if you set num_gpu then this forces the model to
load with that number of layers in the current configuration.
This is done regardless of any other information, which means
that no eviction is performed even if another model is loaded.

This behavior is different from the old estimates (and still
happens for models that runs on the llama engine). In those
cases, models would be evicted if needed to load at the requested
number of layers. That behavior is more useful and less surprising,
so this changes the new estimates to match.

Fixes #12580
2025-10-14 17:46:36 -07:00
Devon Rifkin
53a969d509 Merge pull request #12621 from ollama/drifkin/any-of
qwen3-coder: support anyOf when parsing tool calls
2025-10-14 15:51:24 -07:00
Devon Rifkin
08fbb60bb2 qwen3-coder: support anyOf when parsing tool calls 2025-10-14 15:33:05 -07:00
Daniel Hiltgen
850da848c5 logs: fix bogus "0 MiB free" log line (#12590)
On the llama runner, after the recent GGML bump a new log line reports
incorrect 0 MiB free after our patch to remove memory from the props.  This
adjusts the llama.cpp code to fetch the actual free memory of the active device.
2025-10-14 11:26:28 -07:00
Thomas Stocker
2aba569a2a Vulkan based on #9650 (#11835)
* implement the vulkan C backend

* add support in gpu.go

* add support in gen_linux.sh

* it builds

* fix segfault

* fix compilation

* fix free memory monitor

* fix total memory monitor

* update gpu.go

* fix build

* fix check_perfmon len

* remove cap_get_bound check

* fix vulkan handle releasing

* fix build on federa 40

* fix vulkan on windows

* making amdgpu work on arm achitecutre with vulkan

* add x86_64 lines in VulkanGlobs and capLinuxGlobs

* add aarch64 lines in vulkanGlobs and capLinuxGlobs

* Fix variable name

* Add vulkan build patch from @jmorganca

* Sync vendored ggml to add Vulkan support

* Updated dockerfile

https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Installing rocm library

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* This version works well

built based on this: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Applied 00-fix-vulkan-building.patch

Work done by McBane87 here: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Fixed the "detached head" issues

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Merged in the right direction

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Merging the latest stable (#2)

* Applied 00-fix-vulkan-building.patch

* Implemented vulkan backend based on the work done by whyvl, Dts0, McBane87 and others

Tested on AMD Ryzen 7 8845HS w/ Radeon 780M Graphics with ROCm disabled

```
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-03-11T13:00:40.793Z level=INFO source=gpu.go:199 msg="vulkan: load libvulkan and libcap ok"
time=2025-03-11T13:00:40.877Z level=INFO source=gpu.go:421 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:443 msg="amdgpu detected, but no compatible rocm library found.  Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:348 msg="unable to verify rocm library: no suitable rocm found, falling back to CPU"
time=2025-03-11T13:00:40.879Z level=INFO source=types.go:137 msg="inference compute" id=0 library=vulkan variant="" compute=1.3 driver=1.3 name="AMD Radeon Graphics (RADV GFX1103_R1)" total="15.6 GiB" available="15.6 GiB"
```

```
 # ollama run phi4:14b
>>> /set verbose
Set 'verbose' mode.
>>> how's it going?
Hello! I'm here to help you with any questions or tasks you have. How can I assist you today? 😊

total duration:       3.341959745s
load duration:        18.165612ms
prompt eval count:    15 token(s)
prompt eval duration: 475ms
prompt eval rate:     31.58 tokens/s
eval count:           26 token(s)
eval duration:        2.846s
eval rate:            9.14 tokens/s
>>>
```

* This is no longer needed

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Fixes SIGSEGV: segmentation violation running gemma3 models on ollama 0.6.0 #21

Patch provided by McBane87 on https://github.com/whyvl/ollama-vulkan/issues/21

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Applied 04-disable-mmap-vulkan.patch

From: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Pulled new upstream code for ggml-bulkan backend

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Merged latest ollama 0.6.2 and nasrally's Flash Attention patches (#5)

* readme: add Ellama to list of community integrations (#9800)

* readme: add screenpipe to community integrations (#9786)

* Add support for ROCm gfx1151 (#9773)

* conditionally enable parallel pipelines

* sample: make mutations in transforms explicit (#9743)

* updated minP to use early exit making use of sorted tokens

* ml/backend/ggml: allocate memory with malloc when loading model (#9822)

* runner: remove cache prompt flag from ollama runner (#9826)

We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.

* ollamarunner: Check for minBatch of context space when shifting

Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.

Note that there still some corner cases:
 - A long prompt that exceeds the context window can get truncated
   in the middle of an image. With the current models, this will
   result in the model not recognizing the image at all, which is
   pretty much the expected result with truncation.
 - The context window is set less than the minimum batch size. The
   only solution to this is to refuse to load the model with these
   settings. However, this can never occur with current models and
   default settings.

Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.

* Applied latest patches from McBane87

See this for details: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2708820861

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Add ability to enable flash attention on vulkan (#4)

* discover: add flash attention handling for vulkan
* envconfig: fix typo in config.go

As part of the process some code was refactored and I added a new field
FlashAttention to GpuInfo since the previous solution didn't allow for a
granular check via vulkan extensions. As a side effect, this now allows
for granular per-device FA support checking in other places

---------

Signed-off-by: Vadim Grinco <vadim@grinco.eu>
Co-authored-by: zeo <108888572+zeozeozeo@users.noreply.github.com>
Co-authored-by: Louis Beaumont <louis.beaumont@gmail.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Nikita <50599445+nasrally@users.noreply.github.com>

* Revert Readme changes

* Revert

* Revert changes in amd_linux.go

* Revert changes in amd_linux.go

* Remove flashattention setting gpu.go

* Revert whitespace changes in gpu.go

* Revert changes in transforms_test.go

* Revert changes in runner.go

* Revert changes in Makefile.sync

* Revert some unintented changes in Dockerfile

* Revert vulkan copy changes in Dockerfile

* Update Vulkan Code to de4c07f93783a1a96456a44dc16b9db538ee1618

* Fixed duplicate sync in ggml.go

* Revert changes in ggml.go

* Revert chnages in ggml.go

* enable falsh attention on vulkan

* revert remove parenthesis

* fixed flash attention logic enabling

* vk_check_flash_attention 0 means supported

* Update gpu.go

* Add vulkan to Windows Build script

* Remove commented out code

* Enable Vulkan Flash attention in FlashAttentionSupported

* Fix logging

* Update Vulkan backend to e54d41befcc1575f4c898c5ff4ef43970cead75f

* Removed libcap related code

libcap is not directly related to Vulkan and should be added by its own PR. It adds additional library dependencies for building and also requires users to run setcap or run ollama as root, which is not ideal for easy use

* Fix Unit Test (Add Vulkan Library)

* Add vulkan to TestHomogeneousGPUs
Test

* vulkan: get GPU ID (ollama v0.11.5)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* disable mmap for vulkan

* Reduce Changes remove TestHomogeneousGPUs (doesn't exist on master)

* Update vulkan version to the version used in llama.cpp

* rename gpu patch to correct number

* added Vulkan API to get correct Device UUID

current UUID from pipelineCacheUUID does not match CUDA

* Fix GPU ID Patch

* Remove Code not in llama.cpp

* modified UUID code inside ggml

* Fix Patch

* Copied minimal definition from vulkan header

* Fix compile error in Mac

Metal is preferred so we're disabling Vulkan for now

* Removed unused code

Fix linter error in CI

* Fix patches apply

* fixing lint error

* Removed unneeded function call

Somehow removing this call fixed the crashing when Vulkan header was removed

* added missing NL

* Fixed missing members in Vulkan header

also added zero clear for some structs

* Fixed wrong structure ID

* Fixed Vulkan header

More aligned with official header definition now

* buildvulkanAsSeperateFunction

* Vulkan on Windows Test

* temporarly comment out gate to run windows task

* use temporarly windows-latest for build

* Commenting out other presets to build vulkan

* reenable cpu

* commenting out error action stop

* temporarly commenting out rocm

* set vulkan path

* comment out cude for faster turnaround

* correct vulkan install

* correct vulkan silent install

* fixed install command

* revert debugging changes (vulkan builds on windows)

* revert windows-latest

* trying to build vulkan for linux

* temporarly disable cuda and rocm

* try again linux build

* fix version

* trying to fix

* trying again

* trying again

* fix version

* fixed vulkan-sdk name

* try again

* trying again

* try without version number

* try again

* add some more extra

* trying to use version 1.4.313

* revert debugging changes

* Filter out already supported gpus

* revert debug code

* Use runners for GPU discovery

This revamps how we discover GPUs in the system by leveraging the Ollama
runner.  This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs.  Now the runner does that implicitly based on the actual
device list.  In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.

* timing info for runner

* WIP - wire up Vulkan with the new engine based discovery

Not a complete implementation - free VRAM is better, but not accurate on
windows

* fix - trust the library paths from discovery when starting runner

* fix index bug

* fix vulkan ids to be underlying

* fix - give bootstrapping more time on slow systems

* Test if Vulkan device is supported

* vk_check_flash_attention is not needed (coompat2 coopmapt and scalar implementation exist)

* Handle GGML_VK_VISIBLE_DEVICES

* ask for supported first

* win: fix CPU query buffer handling

Try in a short loop until we get the size right.

* test: harden integration tests for slow start

If the server takes a while to start up, block
tests from starting until it's online to avoid
setting large timeouts in individual test cases.

* gofumpt fix

* fix build

* merge fixes

* merge fixes

* fixed build

* merge fixes

* fixing build

* fixed build

* fixed formatting

* fixed build

* fix vulkan gpu id patch

* sync llama.cpp vulkan code

* update build windows script

* merge fixes

* fix format

* fixed vulkan casing

* handle igpu as gpu

* improve case

* print out unknown library

* rturn Vulkan for vulkan library

* Revert "rturn Vulkan for vulkan library"

This reverts commit 690461a12f.

* fixed patch number

* return Library Name

* remvoe debug code

* return integrated in vulkan backend

* Return pci Properties

* update patch

* directly get pci proeprties without parsing

* workaround for filtering devices. Correct way is to have a LibraryPosition Parameter in the deviceInfo

* Revert "directly get pci proeprties without parsing"

This reverts commit 8e0624851f.

* Set FilteredID for Environment Filtering

* ROCm Library is named ROCm

* revert changes in patch

* Create 0028-vulkan-pci-and-memory.patch

* vulkan memory patch

* casing fix

* Add more pci properties

* Added better memory management

* Added better memory managament

* fixed patch

* Fixed patch

* FilterID creation group by library

* filter out vulkan supported by other gpu

* fixing deviceid compare

* Vulkan Fix FA coopmat1 invalid array indexing

* Use everywhere the same Vulkan Version 1.4.321.1

* Remove unneeded patch

* vulkan update

* sync vulkan glsl files

* only use for vulkan the filteredid (numeric device number)

* simplify code

---------

Signed-off-by: Vadim Grinco <vadim@grinco.eu>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: pufferffish <github@bandersnatch.anonaddy.com>
Co-authored-by: KOISHI KOMEIJI FROM TOUHOU 11 <fuck>
Co-authored-by: DSLstandard <qgeneral35@gmail.com>
Co-authored-by: pufferffish <me@windtfw.com>
Co-authored-by: yeongbba <yeongmo.lee@logpresso.com>
Co-authored-by: tomaThomas <tomathomas@mailbox.org>
Co-authored-by: Antoine Viallon <antoine@lesviallon.fr>
Co-authored-by: Vadim Grinco <vadim@grinco.eu>
Co-authored-by: zeo <108888572+zeozeozeo@users.noreply.github.com>
Co-authored-by: Louis Beaumont <louis.beaumont@gmail.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Nikita <50599445+nasrally@users.noreply.github.com>
Co-authored-by: Masato Nakasaka <masato.nakasaka@intel.com>
Co-authored-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-10-14 10:59:58 -07:00
Devon Rifkin
fd8aa947f3 Merge pull request #12562 from ollama/drifkin/registries
add registries for parsers/renderers
2025-10-14 02:01:53 -07:00
Devon Rifkin
ddaca643d0 add registries for parsers/renderers 2025-10-14 01:13:54 -07:00
Grace
05982a95cb Qwen3VL Cloud Parser and Renderer (#12526)
* working (other than tool call is the incorrect order) for tool calls and tools

* Tests work, other than image tags (tests do not go through server) and tools (not in the correct order, but contents are the same)

* testing for qwen3vl parser - toolparser is working

* made changes to JSON tool parser, wraps the TollCallFunction with a TollCall object

* Working parser for thinking models - assumes state of thinking, emits unambiguous content in thinking, does not call tool call in thinking

* changed the parser to start with collecting content

* thinking prefill

* add hasThinkingSupport parameter to parser

* qwen3-vl -> qwen3-vl-instruct for renderer/parser

* Add hasThinkingSupport=false to QwenVLParser

---------

Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-10-13 16:52:33 -07:00
Gabe Goodhart
4987f13d34 Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552)
* feat: Bump llama.cpp to df1b612

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(mtmd): Correctly encode text chunks during mtmd tokenization

There can be text chunks that appear interspersed with the image embeddings
that contain template delimiter tokens for some models. These need to be
correctly translated to text tokens.

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* tests: Use MtmdChunk in image_test

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix unnecessary conversion linting

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(ggml): Revert changes to ggml_hip.cpp

These changes were done largely by our code assistant and are likely wrong

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Revert changes in mem_nvml.cpp

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update sync point to 1deee0

This brings in several more optimization commits and model support for
EmbeddingGemma

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches for 1deee0

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: sync for bump to 1deee0

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Bad patch updates with errant `+`

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Bump llama.cpp/ggml to 7049736

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: format-patches after latest bump

Branch: LlamaCPPBump-GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-10-13 15:26:18 -07:00
Jeffrey Morgan
e638f2acb6 runner: fix shifting on llama runner (#12604) 2025-10-13 13:46:33 -07:00
Michael Yang
18087f2ec7 Revert "use llama runner for qwen3 (#12556)"
This reverts commit 3d32249c74.
2025-10-13 13:30:30 -07:00
Michael Yang
6c833d5f8d fix(qwen3): deepseek distill
deepseek's qwen3 distill uses a different rope scheme so support both
2025-10-13 13:30:30 -07:00
Jeffrey Morgan
6544e14735 Reapply "add truncate and shift parameters" (#12582) 2025-10-11 16:06:14 -07:00
Devon Rifkin
5db8a818a1 Merge pull request #12581 from ollama/drifkin/renderer-api-generate
routes: fix built-in renderers for `api/generate`
2025-10-11 14:10:23 -07:00
Devon Rifkin
6db8da9958 routes: fix built-in renderers for api/generate
Made it so when api/generate builds up a message array and generates the
prompt it now goes through the same function as `api/chat` for
consistency. This is where we hook the optional built-in renderers to
bypass templates, which was missing for `api/generate` before this
change.

Closes: #12578
2025-10-11 13:57:43 -07:00
frob
0c68ec8d6a discover: fix typo (#12565) 2025-10-11 12:06:02 -07:00
Daniel Hiltgen
70d9e363e1 doc: remove AMD EOL GPUs (#12567) 2025-10-10 17:16:29 -07:00
Michael Yang
1a2feb2a97 ollamarunner: fix deadlock
hardErrCh will deadlock since forwardBatch is blocked on
computeStartedCh which never gets sent. since the response to
hardErrCh is to panic, just panic instead
2025-10-10 16:49:57 -07:00
Daniel Hiltgen
aab2190420 implement nvml for linux (#12517)
* implement nvml for linux

* Improve scheduler logging when VRAM doesn't recover
2025-10-10 15:15:56 -07:00
Michael Yang
629db9dc43 comment split 2025-10-10 13:25:34 -07:00
Michael Yang
e0cd511661 fix test 2025-10-10 13:25:34 -07:00
Michael Yang
207332078f fix lint 2025-10-10 13:25:34 -07:00
Michael Yang
93085127f4 convert: slice gate_up weight 2025-10-10 13:25:34 -07:00
Michael Yang
c00fa9cc2b convert: split gate_up bias 2025-10-10 13:25:34 -07:00
yajianggroup
df411c4b02 refactor: using testing.B.Loop
Signed-off-by: yajianggroup <yajianggroup@outlook.com>
2025-10-10 13:25:29 -07:00
Jeffrey Morgan
3d32249c74 use llama runner for qwen3 (#12556) 2025-10-09 19:08:21 -07:00
Patrick Devine
d681cd7c29 thinking: allow "think": false for non-thinking models (#12555) 2025-10-09 18:46:00 -07:00
shengxinjing
47298fce39 refactor: use builtin max and min 2025-10-09 16:17:52 -07:00
shengxinjing
4a48937ef1 refactor: use builtin max and min 2025-10-09 16:17:52 -07:00
Michael Yang
967a82f52f ollamarunner: measure only active time 2025-10-09 15:44:04 -07:00
Michael Yang
bbbc73d637 llamarunner: update metrics
this change updates how metrics are collected. until now, performance
metrics, specifically initial input processing and subsequent generation
durations, were collected by taking the timestamp when creating a new
sequence, the first token generation, and completing generation. the
processing duration is taken as first token generation sub sequence
creation while generation is taken as completing generation sub first
token generation.

while this approach is an accurate end-to-end metric of processing and
generation, it's not comparable to other tools which only measure the
active, i.e. decode, duration.

this change updates the metrics to only capture decode duration so it
can be more directly compared to other tools
2025-10-09 15:44:04 -07:00
Daniel Hiltgen
15e3611d3d logs: quiet down context canceled on completion and scheduler noise (#12553)
* logs: quiet down context canceled on completion

If the client closes the connection before Completion finishes, we were
logging at error level implying the runner crashed which was misleading.

time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled"

* quiet down scheduler log error on expected case

Since we don't hold the lock while performing memory load calculations, other
runners can unload in parallel, so finding no runner to unload is a valid scenario
which we shouldn't log at error level.
2025-10-09 10:37:47 -07:00
Parth Sareen
77060d462c routes: structured outputs for gpt-oss (#12460) 2025-10-08 19:13:38 -07:00
Patrick Devine
1b91d4dda1 openai: change the reasonin_effort field to also take none 2025-10-08 18:21:01 -07:00
Jeffrey Morgan
7d965258ce Revert "add truncate and shift parameters (#12519)" (#12545)
This reverts commit 6a62b894c7.
2025-10-08 17:57:57 -07:00
Jeffrey Morgan
6a62b894c7 add truncate and shift parameters (#12519) 2025-10-08 17:05:05 -07:00
Patrick Devine
90d429f5a8 thinking: turn on thinking mode for all reasoning models (#12533) 2025-10-08 16:50:13 -07:00
Jesse Gross
1fc35f1260 kvcache: Clean up sliding window state with independent batches
Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that
are out of the cache's window each time we start a new forward pass.

The cache storage needs to handle the window size for each sequence
plus the batch size, since the batch needs to attend to the full
window size. This means that we have greater than a window size
stored while processing the batch.

When the next batch comes, we are currently only looking at the
sequences in the incoming batch to slide the window forward.
However, we also need to clean up the other sequences that might
be occupying space in the batch processing buffer to ensure each
sequence is only using its window size of storage. Failure to do
this can result in "no kv cache slot found" errors.

Fixes: #10127
2025-10-08 16:43:14 -07:00
Jesse Gross
aa45f7ce27 discover: Disable flash attention for Jetson Xavier (CC 7.2)
GGML picks the wrong kernel and these systems fail with:
Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437:
ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu
was compiled for: __CUDA_ARCH_LIST__

Fixes #12442
2025-10-08 09:56:15 -07:00
Daniel Hiltgen
4e5d862ec4 Integration test tuning (#12492)
Remove some flaky scenarios, and switch to chat for better reliability
2025-10-08 09:51:25 -07:00
Daniel Hiltgen
303be9304c docs: improve accuracy of LLM library docs (#12530) 2025-10-07 16:21:07 -07:00
Daniel Hiltgen
bd15eba4e4 Bring back escape valve for llm libraries and fix Jetpack6 crash (#12529)
* Bring back escape valve for llm libraries

If the new discovery logic picks the wrong library, this gives users the
ability to force a specific one using the same pattern as before. This
can also potentially speed up bootstrap discovery if one of the libraries
takes a long time to load and ultimately bind to no devices.  For example
unsupported AMD iGPUS can sometimes take a while to discover and rule out.

* Bypass extra discovery on jetpack systems

On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in
cublasInit so if we detect a Jetpack, short-circuit and use that variant.
2025-10-07 16:06:14 -07:00
Devon Rifkin
bc71278670 Merge pull request #12509 from ollama/drifkin/oai-compat-refactor
openai: refactor to split compat layer and middleware
2025-10-06 16:22:08 -07:00
Daniel Hiltgen
918231931c win: fix build script (#12513) 2025-10-06 14:46:45 -07:00
Daniel Hiltgen
04c1849878 discovery: prevent dup OLLAMA_LIBRARY_PATH (#12514)
This variable isn't currently documented or intended as something the user can
override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling
this in the subprocess environment which will cause problems with the new
bootstrap discovery logic.
2025-10-06 14:36:44 -07:00
Devon Rifkin
2c2f4deaa9 openai: refactor to split compat layer and middleware
This makes the core openai compat layer independent of the middleware
that adapts it to our particular gin routes
2025-10-05 14:18:56 -07:00
Daniel Hiltgen
292767afb4 CI: fix win arm build (#12502)
Resolve subtle erroraction stickiness difference between x86 and arm builder setup
2025-10-04 11:46:45 -07:00
Daniel Hiltgen
ae5e0f0889 CI: replace clang compiler for windows (#12495) 2025-10-04 09:18:42 -07:00
Jesse Gross
19e6796eac llm: Support KV cache quantization with gpt-oss
With the new version of GGML in #12245, KV cache quantization
no longer causes a fallback to CPU.
2025-10-03 16:31:58 -07:00
Grace
33801c1597 Fixed Deepseek2 adding nil tensor error 2025-10-03 14:20:06 -07:00
Daniel Hiltgen
e4340667e3 Workaround broken NVIDIA iGPU free VRAM data (#12490)
The CUDA APIs for reporting free VRAM are useless on NVIDIA iGPU
systems as they only return the kernels actual free memory and ignore
buff/cache allocations which on a typical system will quickly fill up
most of the free system memory.  As a result, we incorrectly think
there's very little available for GPU allocations which is wrong.
2025-10-03 12:17:21 -07:00
Patrick Devine
2fa1e92a99 test: add template error test (#12489) 2025-10-03 12:05:34 -07:00
Daniel Hiltgen
07e36761c3 ci: place rocm windows in correct runner dir (#12487) 2025-10-03 07:28:40 -07:00
Daniel Hiltgen
c29fb007c0 CI: temporarily disable clang install (#12486)
This will likely yield builds that have problems with unicode characters
but at least we can start testing the release while we try to find an
alternate clang compiler for windows, or mingw ships a fixed version.
2025-10-02 20:31:18 -07:00
Daniel Hiltgen
730ed6e9e1 ci: fix windows build (#12485) 2025-10-02 19:16:01 -07:00
Daniel Hiltgen
dc06601677 ci: fix windows build (#12484) 2025-10-02 18:59:26 -07:00
Patrick Devine
1ed2881ef0 templates: fix crash in improperly defined templates (#12483) 2025-10-02 17:25:55 -07:00
Jesse Gross
0bda72892c llm: Enable flash attention by default for qwen3 and qwen3moe 2025-10-02 17:04:10 -07:00
Daniel Hiltgen
55ca827267 AMD: block running on unsupported gfx900/gfx906 (#12481) 2025-10-02 16:53:05 -07:00
Daniel Hiltgen
c68f367ef6 Update GGML to b6646 (#12245)
Notable EOLs with this change:
- MacOS v12 and v13 are no longer supported (v14+ required)
- AMD gfx900 and gfx906 are no longer supported
2025-10-02 14:47:10 -07:00
Jesse Gross
fdb109469f llm: Allow overriding flash attention setting
As we automatically enable flash attention for more models, there
are likely some cases where we get it wrong. This allows setting
OLLAMA_FLASH_ATTENTION=0 to disable it, even for models that usually
have flash attention.
2025-10-02 12:07:20 -07:00
Daniel Hiltgen
05a43e078a fix panic on bootstrapDevices (#12475)
Wrong index variable was used.
2025-10-01 17:39:29 -07:00
Daniel Hiltgen
bc8909fb38 Use runners for GPU discovery (#12090)
This revamps how we discover GPUs in the system by leveraging the Ollama
runner.  This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs.  Now the runner does that implicitly based on the actual
device list.  In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.
2025-10-01 15:12:32 -07:00
Devon Rifkin
6b50f2b9cd Merge pull request #12461 from ollama/drifkin/qwen3-coder-tweaks
qwen3-coder: fix tool definition type rendering
2025-09-30 19:47:44 -07:00
Michael Yang
35ac4eb12c fix keep alive
this reference to keep alive was missed in #12041 so chat has a
diffferent behaviour than generate
2025-09-30 17:22:28 -07:00
Jesse Gross
3d0b1734c0 ggml: Preallocate CUDA pool memory
The GGML CUDA backend allocates additional memory for intermediate
results during calculation. This memory isn't currently allocated
during worst case graph reservation and therefore not included in
scheduling. This means that as these buffers potentially grow
with context length, we could crash.

This extends the memory allocation system down layer from the GGML
graph to the CUDA layer, preallocating the worst case memory there
as well.

Fixes #11753
2025-09-30 15:04:43 -07:00
Jesse Gross
efaee8c2d6 ggml: Backport scale kernel fixes
The GGML scale kernel uses signed 32-bit ints to represent
the number of elements in the tensor. For large images,
mistral-small3.2 overflows this, triggering CUDA errors due
to negative arguments.

Currently, this can happen when the user passes a large image
to mistral-small3.2. However, with upcoming changes to reserve
CUDA memory, it happens every time mistral-small is loaded as
we reserve using a worst case batch.

This patch is part of an upstream GGML commit and should be removed
after GGML is updated past 0a1b398 "ggml: add ops for WAN video model
(cuda && cpu) (#15669)".

Fixes #10388
2025-09-30 15:04:43 -07:00
Jesse Gross
734b57da0e ggml: Remove allocation status reporting
For each memory allocation we report the size of the (attempted)
allocation and whether it succeeded or failed. The latter status
reporting proved to be not that useful in practice as systems
such as Windows can automatically overflow from VRAM into RAM,
resultings in successful allocations even when there isn't
enough memory where we wanted.

As a result, this information is only used for debug logging,
which isn't worthwhile enough for the amount of code. It
also isn't fully accurate, as multiple allocations may result
in partial failures.
2025-09-30 15:04:43 -07:00
Devon Rifkin
83021fcf0f qwen3-coder: fix tool definition type rendering 2025-09-30 15:03:15 -07:00
Michael Yang
0469861d9d build: call find_package to instantiate library paths 2025-09-30 13:12:46 -07:00
羊撅撅
c47154c08d fix: correct condition for AMDGPU_TARGETS filtering logic (#12412) 2025-09-26 11:38:47 -07:00
Patrick Devine
b04e46da3e bugfix: restore the current runOptions if loading fails in the CLI (#12402)
There are two bugs when using `/load <model>` for a model that doesn't exist, namely:
  1. it will not restore the current model settings if the current model is a thinking model; and
  2. it will crash is the current model is a non-thinking model

This bug fix saves the current runOptions and then restores them if the model load
doesn't happen. It also fixes the crash happening for non-thinking models.
2025-09-25 18:30:45 -07:00
Devon Rifkin
34efbbd3f0 Merge pull request #12417 from ollama/drifkin/qwen3-coder-unicode
parsers: fix unicode handling for qwen3-coder
2025-09-25 15:56:34 -07:00
Devon Rifkin
05ba4ca1f4 parsers: fix unicode handling for qwen3-coder
When trimming whitespace at the end of every chunk, we were iterating
backwards over the string byte-by-byte instead of rune-by-rune.

As an example of how this can cause corruption, suppose we have the
multi-byte character  (`"\u2705"`), which is represented in utf-8 as
the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which
passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this
caused us to mistakenly slice in the middle of the rune, removing `0x85`
and leaving `0xE2 0x9C`, which beyond being the incorrect place to
slice, is not even a valid utf-8 character.

`trailingWhitespaceLen()` was modified to count from the end in a
rune-aware way. Tests with various multibyte unicode characters were
also added.


Fixes: #12414
2025-09-25 15:47:46 -07:00
Patrick Devine
5a56ff3cf0 cli: add device signin flow when doing ollama push (#12405) 2025-09-25 15:04:43 -07:00
Gabe Goodhart
2fba04b5fb tools: handle the case where a tool call sends "arguments" or "parameters" as a serialized json string (#12413) 2025-09-25 14:37:39 -07:00
Grace
fbd82ba5bb Grace/deepseek v3 migration (#12385)
* init deepseek model file

* temp removal of flash attention implementation

* shapes and proper, can make a pass

* query, key, value have good cosine similarity, but the max diff is a bit high

* Attention block is working! ** with eager for now, have not added the mask line

* Attention block is working! ** with eager for now, have not added the mask line

* working MoE at around 0.95 cosine sim

* added cosine similarity function

* Starting end to end structure

* Trying (and failing) to get rope to work, going to test full thing on tater

* running on tater36... just not the right outputs

* we have the right values for rope... but its still not working?

* chnage Extrapolation Factor to 1

* removed adding residuals twice, removed normalization from shared expert, refactored Norms (Attention, MLP) to be outside the (Attention, MLP) blocks and in the Transformer block instead, add cache setLayer

* Temporary modelfiles for cpu

* change kpass intermediate step to kv, two layer outputs [0,1] look fine

* this calls for 16 chicken nuggets

* whoops

* cleaning up code

* delete stuff we dont need

* getting rid of debug statements for llama cpp

* working with long contexts

* fix long context view error

* reverting some changes I made for files that are not apart of pr

* Added proper tokenizer for deeepseek3

* clean up model and go test

* remove Modelfile

* not passing the tests

* whoops

* how to pass the ci tests

* resolving some of the comments

* rename

* linted and renamed deepseek3 -> deepseek2

* remove name go

* addressed changes - main change was adopting qwen3 naming scheme

* I cannot with linters

* clean up logs

* clean up logs

---------

Co-authored-by: Grace Guo <graceguo@Graces-MBP.localdomain>
Co-authored-by: Grace Guo <graceguo@Graces-MacBook-Pro.local>
Co-authored-by: graceguo <graceguo@tater36.localdomain>
2025-09-24 15:19:47 -07:00
Michael Yang
2e742544bf prefer ollama engine for qwen3moe (#12374) 2025-09-24 11:21:32 -07:00
Devon Rifkin
bbb195a6ff Merge pull request #12393 from ollama/drifkin/fix-built-ins
harmony: don't sanitize built-ins
2025-09-23 23:45:31 -07:00
Devon Rifkin
fd88cd7cb0 harmony: don't sanitize built-ins
In #11910 we started sanitizing function names, but we accidentally were
modifying built-ins like `browser.open` to `browser_open`. This was
removing the special prompt rendering for built-ins, but this wasn't
immediately apparent since the models seem to be reasonably good at
remembering the built-ins even when presented with these slightly
renamed version. This fix prevents built-ins from ever being renamed.
2025-09-23 23:34:55 -07:00
Michael Yang
e1979c571a fix: leaf alt name (#12390)
a leaf node with an alternative name gets all its alternatives names
added into the same branch rather than creating branches themselves
2025-09-23 17:50:53 -07:00
Michael Yang
bf78ed6ee9 add pre:, suf: to tags (#12274) 2025-09-23 16:08:57 -07:00
Michael Yang
a40d427bce multi-regexp pretokenizer (#12325) 2025-09-23 13:21:47 -07:00
Patrick Devine
64883e3c4c auth: fix problems with the ollama keypairs (#12373)
* auth: fix problems with the ollama keypairs

This change adds several fixes including:
  - reading in the pubkey files correctly
  - fixing the push unit test to create a keypair file in a temp directory
  - not return 500 errors for normal status error
2025-09-22 23:20:20 -07:00
Devon Rifkin
41efdd4048 Merge pull request #12339 from ollama/drifkin/harmony-refactor-to-builtin
harmony: remove special casing in routes.go
2025-09-22 13:13:40 -07:00
Daniel Hiltgen
c23e6f4cae tests: add single threaded history test (#12295)
* tests: add single threaded history test

Also tidies up some existing tests to handle more model output variation

* test: add support for testing specific architectures
2025-09-22 11:23:14 -07:00
jmorganca
af060eb250 docs: update cloud.md for cloud models 2025-09-22 13:09:17 -03:00
jmorganca
ae5c33008e docs: move turbo.md to cloud.md 2025-09-22 13:09:17 -03:00
Devon Rifkin
3677842ff1 Merge pull request #12358 from ollama/drifkin/qwen3-coder-ampersands
parsers: fix `&`s in qwen3coder parameter values
2025-09-20 12:40:33 -07:00
Devon Rifkin
242df70a75 parsers: fix &s in qwen3coder parameter values
In <https://github.com/ollama/ollama/issues/12357> we that the model
will output tool calls such as

```
<function=shell>
<parameter=command>
pwd && ls -la
</parameter>
</function>
```

We parse this using the approach of transforming into valid xml and then
using an xml parser. While we do transform the function and parameter
names, we weren't escaping the parameter values (which in this example
are invalid since `pwd && ls -la` contains unescaped ampersands).

This has been fixed by first transforming the tags in the same way, and
then walking the transformed string and escaping the text in between the
tags. This also fixes a case where `<` in the middle of a parameter
value would cause an xml parse failure.

Fixes: #12357
2025-09-20 12:11:38 -07:00
Patrick Devine
dba39b2eee gemma: fix rope scaling for qat models (#12348)
* gemma: fix rope scaling for qat models

* gofumpt yourself
2025-09-19 15:04:40 -07:00
Michael Yang
9f3a37fd36 fix: model load for unsupported embedding models (#12311)
with #12181, there's now support for embeddings in ollama engine.
this is done by mutating the architecture and adding _embed when it
detects an embedding model. however this introduced a bug where if
an embedding model was run based on an existing ollama engine model
without an embedding implementation, e.g. llama4, it will pass the
initial arch support check but fail when actually loaded.

there's currently two entrypoints to creating a model. previously this
second entrypoint was necessary because calling model.New would also
load the model. since #11818, this is no longer th case so merge them
to reduce complexity
2025-09-18 16:11:08 -07:00
Michael Yang
7460259eb3 feat: qwen3 embed (#12301)
* cleanup

* use pooling.TypeNone

* pooling test

* qwen3 embed
2025-09-18 15:50:32 -07:00
Jeffrey Morgan
22ccdd74c2 server: add unauthorized error to remote chat handler (#12338) 2025-09-18 15:40:31 -07:00
Daniel Hiltgen
0c3d0e7533 build: avoid unbounded parallel builds (#12319)
With the addition of cuda v13, on a clean setup, the level of parallelism
was causing docker desktop to become overwhelmed and compilers
were crashing.  This limits to 8 parallel per build stage, with the ability
to override if you have many more cores available.
2025-09-18 14:57:01 -07:00
Devon Rifkin
e7f56ef3d8 harmony: remove special casing in routes.go
Now that we have a built-in parser abstraction, which was introduced in
<https://github.com/ollama/ollama/pull/12248>, we can modify our harmony
parser to match this and then get rid of nearly all of the
harmony-specific logic in routes.go. We do have a small amount of
code that turns the parser on by default if the architecture matches and
no other built-in parser was provided.

The built-in parser interface was modified in order to handle harmony's
prefill and tool name translation requirements.
2025-09-18 14:55:59 -07:00
Patrick Devine
eb0a5d4459 auth: check the permissions on the private key to see if it's readable (#12336) 2025-09-18 14:34:34 -07:00
Michael Yang
ceac416ec2 fix(integration): check truncated length (#12337) 2025-09-18 14:00:21 -07:00
Patrick Devine
2717dce6fe convert: convert bf16 vision weights to fp16 (#12324)
This change moves back to converting bf16 vision weights to fp16,
specifically if they start with the name "v." (such as v.blk.0.attn_k.weight).

This fixes a bug where converted images are failing because they are trying
to call `im2col` which doesn't have a bf16 kernel in ggml.
2025-09-17 17:43:17 -07:00
frob
9b8187b487 server: skip parsing initial <think> if provided in the prompt for /api/generate (#12289) 2025-09-17 16:39:04 -07:00
Patrick Devine
8b894933a7 engine: add remote proxy (#12307) 2025-09-17 14:40:53 -07:00
Daniel Hiltgen
9c5bf342bc fix: multi-cuda version skew (#12318)
Ensure that in a version skewed multi-cuda setup we use the lowest version for all GPUs
2025-09-17 13:05:09 -07:00
Michael Yang
564b558c92 fix(llama): other llama flavours (#12308)
* fix(llama): rope scale

* spm llama

* skip moe models

* cleanup
2025-09-17 12:12:21 -07:00
Michael Yang
a417ac97ee prefer ollama engine for qwen3 (#12310) 2025-09-17 09:48:21 -07:00
russcoss
05d53457af refactor: use the built-in max/min to simplify the code (#12280)
Signed-off-by: russcoss <russcoss@outlook.com>
2025-09-16 17:14:21 -07:00
Michael Yang
b225508c9b logutil: fix source field (#12279) 2025-09-16 16:18:07 -07:00
Devon Rifkin
fa1c987a29 Merge pull request #12248 from ollama/drifkin/qwen3-coder-parsing
add qwen3-coder tool support
2025-09-16 10:21:43 -07:00
Michael Yang
ad95d5b30b use split activations when possible (#12293)
* use ggml_*_split activations when possible

* forward qkv
2025-09-16 09:51:19 -07:00
Michael Yang
c253433d68 embed: cleanup (#12299)
* cleanup

* use pooling.TypeNone

* pooling test
2025-09-16 09:48:42 -07:00
Beshoy Girgis
a1cff89b30 fix: fix CUDA detection for older GPUs (#12300)
Prioritize GPU compute capability over driver version to ensure
Pascal GPUs (CC 6.1) use compatible CUDA v12 libraries instead of v13.
2025-09-16 07:47:06 -07:00
Daniel Hiltgen
93c64ea1b1 doc: show how to clear the cgo cache (#12298) 2025-09-15 15:45:35 -07:00
Michael Yang
3f6642f6fc model: implement bert in ollama engine (#9080)
* fix truncate

* s/SentencePieceModel/SentencePiece/

* bert

* wordpiece

* refactor pooling

* more tokenizers

* normalize embeddings
2025-09-15 15:35:59 -07:00
Michael Yang
6f7117145f batch: use tensors for outputs (#12185)
this cleans up the model interface slightly without too much impact in
other areas
2025-09-15 14:33:06 -07:00
Devon Rifkin
472feec2ff address comments 2025-09-15 11:46:25 -07:00
Devon Rifkin
47991940d4 add qwen3-coder tool support
The format qwen3-coder uses is relatively unique, both in rendering and
in parsing. To implement parsing, I wrote a custom parser in similar
style to harmony. For the rendering, I found that the logic would be
much more difficult to follow in a template, so I introduced the concept
of a built-in renderer that uses go code, rather than a template to
generate prompts.

I set us up for future built-in parsers and renderers by making it so
they can be specified in a Modelfile like so:

```
RENDERER "qwen3-coder"
PARSER "qwen3-coder"
```

These need to be provided explicitly because the architecture alone is
not enough to understand what format the model expects to receive, and
what format we expect it to output (e.g., qwen3-coder is `qwen3moe`,
which includes other qwen3-family models as well)

I haven't converted harmony to be one of these "built-ins" yet, since
some of it is in flux with the changes @ParthSareen has been making to
move harmony to the runner. It is likely that many other built-ins will
need to move to the runner as well, but I'm able to slightly defer that
decision since qwen3-coder doesn't have thinking (and therefore doesn't
need to be in the runner to make structured outputs work). I expect to
unify harmony with this approach very soon.

Whether a particular model supports tools or thinking was previously
inferred from templates, but without a template we now also use the
parser itself to declare what it supports. If we have future models that
re-use the same parsing format, but have different capabilities, we'll
want to parameterize them and give them different names to be specified
as a `PARSER`.

Misc changes:

- I worked on the renderer by diffing outputs from the reference
  implementation and ours. To make it easier to do this, I extended
  <https://github.com/ollama/ollama/pull/11875> to also support
  returning the prompt via the openai compat layer
2025-09-15 11:33:47 -07:00
jmorganca
92b96d54ef Revert "runner: move harmony to runner (#12052)"
This reverts commit 1a558f98e2.
2025-09-12 20:40:14 -03:00
jmorganca
9d56e63dbf Revert "runner: simplify parser entrypoints in runner (#12233)"
This reverts commit 8d6fffaead.
2025-09-12 20:40:14 -03:00
tc-mb
053092185e Fix image cannot be seen with slice image on llama engine
Ollama's recent engine update, llama.cpp, caused all models requiring a slice schema to not display images. As a result, the value of numTokens isn't always the length of the sliced ​​image embed, but rather the end length of the schema. This causes the image embed to not be correctly included during all slice processing.
2025-09-12 16:25:12 -07:00
Daniel Hiltgen
44a6792873 tests: tighten up a few flaky tests (#12271)
Sometimes the context test results are pure emoji's
Thanksgiving has too much variability, so swap for a more straight forward prompt.
2025-09-12 13:59:34 -07:00
Daniel Hiltgen
e4ce68311a cuda: remove compression for better compatibility (#12259)
This retains compatibility with driver 531 and up at the trade-off of space.
2025-09-12 07:59:14 -07:00
Jesse Gross
26214125e8 ollamarunner: Suppress stack trace during memory allocation
Allocation failures can be a normal part of new memory estimates, so
we shouldn't print a stack trace in this case.
2025-09-11 14:30:31 -07:00
Daniel Hiltgen
61fb912ca4 CI: fix windows cuda build (#12246)
* ci: adjust cuda component list

v13 has a different breakdown of the components required to build ollama

* review comments
2025-09-11 12:25:26 -07:00
Jesse Gross
aba1575315 llm: Don't try to load split vision models in the Ollama engine
If a model with a split vision projector is loaded in the Ollama
engine, the projector will be ignored and the model will hallucinate
a response. Instead, fallback and try to load the model in the llama
engine.
2025-09-11 11:41:55 -07:00
Jesse Gross
eb10390de9 llm: Enable new memory estimates by default
New memory estimates (see #11090 for more information) are now
enabled automatically for all models running on the Ollama engine,
improving both stability and performance through more accurate sizing
and allocation. Models running on the llama engine will continue to
use the original style of memory estimation.
2025-09-11 11:21:53 -07:00
Michael Yang
feb18cd710 feat: add dimensions field to embed requests (#12242)
* feat: add field to truncate embeddings

* add openai embeddings for dimensions
2025-09-11 10:36:10 -07:00
fengyuchuanshen
8a7e2055d2 cmd: use slices.Contains to simplify code (#12249) 2025-09-11 09:57:31 -07:00
Jesse Gross
29ddfc2cab ggml: Disable flash attention for gemma2
Our new engine implementation of gemma2 doesn't support flash
attention, which means that it also doesn't support KV cache
quantization. Currently, it is possible to turn these two on,
which will result in a crash.
2025-09-10 16:40:45 -07:00
Jesse Gross
71cb86af3e llm: Remove unneeded warning with flash attention enabled
If flash attention is enabled without KV cache quanitization, we will
currently always get this warning:
level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""
2025-09-10 16:40:45 -07:00
CarbonatedWater.org
5198956372 docs: add ollama-co2 to community integrations (#12230) 2025-09-10 16:37:10 -07:00
Daniel Hiltgen
17a023f34b Add v12 + v13 cuda support (#12000)
* Add support for upcoming NVIDIA Jetsons

The latest Jetsons with JetPack 7 are moving to an SBSA compatible model and
will not require building a JetPack specific variant.

* cuda: bring back dual versions

This adds back dual CUDA versions for our releases,
with v11 and v13 to cover a broad set of GPUs and
driver versions.

* win: break up native builds in build_windows.ps1

* v11 build working on windows and linux

* switch to cuda v12.8 not JIT

* Set CUDA compression to size

* enhance manual install linux docs
2025-09-10 12:05:18 -07:00
Parth Sareen
8d6fffaead runner: simplify parser entrypoints in runner (#12233) 2025-09-10 11:24:42 -07:00
Parth Sareen
20b53eaa72 tests: add tool calling integration test (#12232) 2025-09-09 14:01:11 -07:00
Daniel Hiltgen
6745182885 tests: reduce stress on CPU to 2 models (#12161)
* tests: reduce stress on CPU to 2 models

This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently

* tests: allow slow systems to pass on timeout

If a slow system is still streaming a response, and the response
will pass validation, don't fail just because the system is slow.

* test: unload embedding models more quickly
2025-09-09 09:32:15 -07:00
Kashyap Tanuku
f810ec741c readme: add Clueless to community integrations (#12188) 2025-09-08 21:31:29 -07:00
Jesse Gross
e119783e66 llm: Clamp batch size to context size
The context must always be able to store the current batch, so
if the user requests a small context then we should also shrink
the batch to match. This also fixes the TestLongInputContext
test on the new engine. (The old engine already has this behavior.)
2025-09-08 20:40:11 -07:00
Parth Sareen
1a558f98e2 runner: move harmony to runner (#12052) 2025-09-08 15:07:59 -07:00
Gabe Goodhart
7b91c9ce51 Hybrid and recurrent memory estimates (#12186)
This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers.

The logic for the sizing of the recurrent layers comes from the llama.cpp implementation

        ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size);
        ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size);

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-09-08 14:53:22 -07:00
Daniel Hiltgen
950d33aa30 docs: show how to debug nvidia init failures (#12216)
This debug setting can help troubleshoot obscure initialization failures.
2025-09-08 11:39:00 -07:00
Michael Yang
9714e38dd0 fix: nil pointer dereference if cache is nil (#12215) 2025-09-08 09:53:59 -07:00
frob
4378ae4ffa parser: don't check the file type of safetensors to prevent false negatives. (#12176)
* Don't check the file type of safetensor to prevent false negatives.

---------

Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-09-05 16:27:40 -07:00
Michael Yang
5994e8e8fd embedding gemma model (#12181)
* ollama: add embeddings
2025-09-04 09:09:07 -07:00
Michael Yang
b3e6120736 more logutil.Trace (#12177) 2025-09-03 17:24:39 -07:00
Michael Yang
fb92b61754 logutil: add Trace and TraceContext helpers (#12110) 2025-09-02 13:09:12 -07:00
Jesse Gross
8149a3c86e llm: Avoid underflow in free memory logging
If a GPU's free memory is less than the reserved amount, we might get
an underflow. Since it is an unsigned uint64, we print this as a large
number rather than the more correct 0. This only affects logging, the
actual layout code already handles this correctly.

Bug #12138
2025-09-02 12:30:26 -07:00
Daniel Hiltgen
0cc90a8186 harden uncaught exception registration (#12120) 2025-09-02 09:43:55 -07:00
pxwanglu
e42300f25b ml: fix struct field name in comment (#12123) 2025-08-31 16:26:11 -07:00
alpha-nerd-nomyo
66e73809a1 readme: add NOMYO Router to community integrations (#12129) 2025-08-31 13:49:10 -07:00
Daniel Hiltgen
517807cdf2 perf: build graph for next batch async to keep GPU busy (#11863)
* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.
2025-08-29 14:20:28 -07:00
Daniel Hiltgen
ead4a9a1d0 Always filter devices (#12108)
* Always filter devices

Avoid crashing on unsupported AMD iGPUs

* Remove cuda device filtering

This interferes with mixed setups
2025-08-29 12:17:31 -07:00
ofrancon
4383a3ab7a readme: add Neuro SAN to community integrations (#12109) 2025-08-28 12:27:13 -07:00
Jesse Gross
9d97e6a9f1 ggml: Avoid allocating CUDA primary context on unused GPUs
The recent memory management changes caused all GPUs to be visible
to the runner, regardless of whether they are ultimately used. This
caused CUDA devices to allocate a primary context (~300 MB VRAM) on
each GPU, for each model. This is unnecessary, so we can both avoid
touching GPUs that we exclude in the early stage of allocation and
freeing the memory for any that we touch but don't use.

The issue will continue to exist for the old engine, since it touches
all devices during initialization.
2025-08-27 16:24:18 -07:00
Michael Yang
1081532430 fix keep alive (#12041) 2025-08-27 11:51:25 -07:00
Michael Yang
59412fbb43 convert(gptoss): mxfp4 to ggml layout to avoid jit conversion (#12018)
* convert: return bytes written

* ggml flavor mxfp4

* simplify jit conversion

* comment
2025-08-26 16:41:02 -07:00
Michael Yang
86834a2797 convert: fix tensor sorting (#12015)
there's two bugs here.

1. the check for a layer id is incorrect and should be >= 0 since layer
   0 is valid
2. if both tensors have an layer identifier, it will only compare the
   layer id which will return 0 if the tensors are in the same layer.
   instead it should fallback to comparing the full tensor name
2025-08-26 13:57:46 -07:00
Michael Yang
85ccf7354d gptoss: enable flash attention by default (#11996) 2025-08-26 13:34:45 -07:00
Michael Yang
30fb7e19f8 remove extra field attr (#11205) 2025-08-25 09:58:16 -07:00
Jeffrey Morgan
d3450dd52e api: implement stringer for ToolFunctionParameters (#12038) 2025-08-22 16:26:48 -07:00
Jeffrey Morgan
4bcb04ad88 tools: avoid matching braces that are part of tool content (#12039) 2025-08-22 15:22:14 -07:00
Devon Rifkin
e3d5708754 Merge pull request #12021 from ollama/drifkin/thinking-double-emit
thinking: fix double emit when no opening tag
2025-08-22 12:01:37 -07:00
Jeffrey Morgan
4be4dc8717 server: skip parsing initial <think> if provided in the prompt (#12024) 2025-08-22 12:00:16 -07:00
zoupingshi
109d4fc3b4 chore: remove redundant words in comment (#12028)
Signed-off-by: zoupingshi <hangfachang@outlook.com>
2025-08-22 11:00:27 -07:00
Devon Rifkin
2cb0a580f3 thinking: fix double emit when no opening tag
The thinking parser will automatically transition to being a
pass-through if non-whitespace is seen before an opening tag. However,
we weren't clearing the buffer after the first non-whitespace input, so
in practice the first token would be emitted twice.

Added a test that demonstrated this, and then fixed the bug.
2025-08-21 21:03:12 -07:00
Parth Sareen
7cce5aac76 harmony: move harmony parsing into a package (#12016) 2025-08-21 13:56:22 -07:00
Michael Yang
4ae4f47b16 gpt-oss: convert from hugging face format (#11907) 2025-08-20 15:39:18 -07:00
Jesse Gross
073fa31df5 llm: Don't always evict models in CPU-only mode
With old memory estimates, it's currently impossible to load more
than one model at a time when no GPUs are available. This is because
the check for whether we need to evict a model looks to see if all
layers of the new model can be loaded onto GPUs, which is never true
if there are no GPUs. Before the memory management changes, there
was a special code path for CPU-only systems.

This problem does not exist with new memory estimates.

Fixes #11974
2025-08-20 14:31:02 -07:00
Michael Yang
91fc3c48e3 openai: remove reasoning as an api.Options (#11993) 2025-08-20 12:21:42 -07:00
Devon Rifkin
6de62664d9 Merge pull request #11973 from ollama/drifkin/bpe
model: fix boundary in bpe
2025-08-19 22:58:33 -07:00
Devon Rifkin
463a6caad8 model: add bpe roundtripping tests 2025-08-19 22:05:48 -07:00
Devon Rifkin
fc5fb09f51 model: fix boundary in bpe
0x007e is a tilde and was getting adjusted (+0x00a2) to 0x0120 in the
encode, but then in the decode it was getting adjusted down (-0x0100) to
0x0020. The boundary for the +0x00a2 case has been adjusted to fix this

Fixes: #11966
2025-08-19 18:34:49 -07:00
Jesse Gross
05ccb17c6e kvcache: Use Cast instead of Copy for flash attention masks
Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.
2025-08-19 12:36:28 -07:00
Michael Yang
f804e8a460 disable output_all (#11959) 2025-08-18 17:45:40 -07:00
Kostis
9cfbffafc5 readme: add any-agent to community integrations (#11950) 2025-08-18 14:21:36 -07:00
Ruslan Suleymanov
470d580205 readme: add Andes to community integrations (#11952) 2025-08-18 14:20:28 -07:00
Devon Rifkin
b517bb1c19 Merge pull request #11910 from ollama/drifkin/harmony-fn-names
harmony: convert fn names to be valid ts identifiers
2025-08-18 14:17:47 -07:00
Jesse Gross
e3ade453a8 llm: Check for nil memory data before printing
We dump out our best memory estimate after we complete processing
for any reason, including errors. This is helpful for finding what
what stopped us in error conditions but in some cases we might not
have gotten even the first result yet.

Fixes #11957
2025-08-18 14:05:22 -07:00
Devon Rifkin
048bd4472a harmony: convert fn names to be valid ts identifiers
In <https://github.com/ollama/ollama/issues/11704#issuecomment-3177380197>
I noticed that hyphens in function names could possibly cause the model
to become confused. Later in that issue I found other explanations, but
at a minimum tool names with spaces in them are confusing to the model
because of the prompt format.

In this change I create a mapper that converts arbitrary tool names into
valid typescript identifiers. It's a little overly strict in that it
doesn't allow all unicode characters that might be valid in ts
identifiers, but it's still very permissive. Since mappings aren't
reversible, we must temporarily store this mapping in order to unmap it
if the model comes back with a call. We also handle the case where
multiple mappings collide into the same mapping and append a counter to
the end to make them unique
2025-08-18 14:05:16 -07:00
Devon Rifkin
ec8bf5e6c5 Merge pull request #11875 from ollama/drifkin/print-template
server: add debug option for printing out prompt instead of calling model
2025-08-18 14:03:14 -07:00
Kostis
709bbb0b6d readme: add any-llm to community integrations (#11956) 2025-08-18 13:13:26 -07:00
Jody Doolittle
abeec240f9 readme: add Serene Pub to community integrations (#11946) 2025-08-18 13:12:41 -07:00
Michael Yang
df335aac09 gpt-oss: disable quantized kv cache (#11929) 2025-08-15 15:01:05 -07:00
Patrick Devine
026bc29237 cli: show the default context length env setting in online help (#11928) 2025-08-15 14:59:52 -07:00
Thomas Pelster
883d031268 docs: added missing comma in 'Ollama's Javascript library'' (#11915) 2025-08-15 14:45:01 -07:00
Daniel Hiltgen
5271ff8559 handle cgo flags in docker build (#11909)
Docker build requires build-args to be defined.  This ensures the release.yaml settings will be used.
2025-08-15 14:39:35 -07:00
Daniel Hiltgen
d6f7233a1c test: improve scheduler/concurrency stress tests (#11906)
* test: improve scheduler/concurrency stress tests

The scheduler test used to use approximate memory figures and would often
over or under shoot a systems capcity leading to flaky test results.
This should improve the reliability of this scenario by leveraging
ps output to determinie exactly how many models it takes to
trigger thrashing.

The concurrency test is also refined to target num_parallel + 1 and handle
timeouts better.

With these refinements, TestMultiModelConcurrency was redundant

* test: add parallel generate with history

TestGenerateWithHistory will help verify caching and context
are properly handled while making requests

* test: focus embed tests on embedding models

remove non-embedding models from the embedding tests
2025-08-15 14:37:54 -07:00
Devon Rifkin
8de1da4767 server: add debug option for printing out prompt instead of calling model 2025-08-15 13:52:50 -07:00
Daniel Hiltgen
d925b5350c Revert "cuda: leverage JIT for smaller footprint (#11635)" (#11913)
This reverts commit dc5a645434.
2025-08-14 21:19:23 -07:00
Daniel Hiltgen
6eaf194b85 fix arm linux build when HWCAP2_SVE2 undefined (#11908) 2025-08-14 16:38:53 -07:00
Jesse Gross
d5a0d8d904 llm: New memory management
This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).

It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.
2025-08-14 15:24:01 -07:00
Michael Yang
ef7d26ba2c convert: skip reading into memory when possible (#11507)
if there's no transformation to the tensor and the input and output
types match, copy directly into the writer. also read from a bufio with
a 32K buffer
2025-08-14 15:03:57 -07:00
Michael Yang
1a19df1f3a update vendored llama.cpp and ggml (#11823)
* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch

This will be redone once my branch is merged upstream in llama.cpp

* feat: Update all patches

There are a number that are no longer needed at all:

- 0003-embeddings: Embeddings entirely overhauled on master
- 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely
    overhauled on master
- 0019-metal-add-mean-kernel-14267: Merged upstream
- 0020-CUDA-add-mean-operation-14313: Merged upstream

* feat: Sync llama.cpp and ggml

* fix: Update rsync-filter for all moved/new/removed files

* fix: Add files missing from sync

* fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs

* fix: Add ggml files missing from sync

* fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files

* fix: Remove mtmd main cpp files

* fix: Add missing include in sampling_ext.cpp

* fix: Update llama.go to use mtmd instead of clip/llava

* fix: Add patch for mtmd_input_text

* chore: Ignore *.patched in the patch directory

* fix: Fix support for arch-specific ggml-cpu source files with new arrangement

In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific
implementations were split out into a nested tree structure under
ggml-cpu/arch. This conflicts with standard CGO layout where all
arch-specific source files are expected to live in the same directory as
the parent go module and use suffixes based on GOOS and GOARCH. As such,
there were really two options for getting this to work:

1. Add a patch on top of the GGML sync to rearrange the files to match the
GO layout convention
2. Use CGO directives to conditionally include the nested source files in
the compilation units

This commit does (2) in order to minimize the set of changes needed on top
of the upstream file layout. To get this to work, there are two key things
needed:

1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in
the preprocessor directives
2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to
explicitly include the .c|.cpp files for the given architecture from the
nested directory

* fix: Use mtmd_helper to correctly load the bitmap for the image

* fix: Apply patch for mtmd_text_input

* fix: Add missing stb to llama.cpp rsync-filter

* fix: Add sync'ed stb vendored header

* fix: Use c++17 and include vendor for go wrapper modules

* fix: Update patch 0015 for upstream implementation of uuid

* feat: Bump to the latest tip of the branch

* fix: Update patches for bump

* feat: Bump back to the cenral repo and point at the latest master

This includes granite 4 and a number of other model architectures!

* fix: Revert changes to ggml export GPU UUID patch

* fix: Add patch for GGML_VERSION and GGML_COMMIT constants

* feat: Sync all patched code

* build: Include cmake/common.cmake in ggml sync

* build: Add top-level include for GNUINstallDirs in CMakeLists.txt

This is used to populate CMAKE_INSTALL_BINDIR

* fix: Add a patch to avoid power throttling API on non-msvc windows builds

* fix: Sync patch changes for ggml-cpu.c

* feat: Bump llama.cpp to 4a4f42

This picks up support for Kimi K2 and PLaMO-2

* feat: Sync llama.cpp

* fix: Handle multi-chunk image encodings from mtmd

* fix: Re-number patches after merge with `main`

* feat: Bump to 41e78c in the makefile

* fix: Fix Solar and argsort/copy patches after bump

* fix: Remove Gemma3n CUDA Graphs patch

It was implemented upstream:
https://github.com/ggml-org/llama.cpp/pull/14741

* feat: Sync llama.cpp / ggml after latest bump

* build: Remove unnecessary CFLAGS definitions in cpu.go

* fix: Remove unnecessary additions in the rsync-filter

* fix: Remove unused vendored code for chat template parsing

* Revert "fix: Remove Gemma3n CUDA Graphs patch"

This reverts commit d724caced3.

* fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes

https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394

* fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n

* unwind mxfp4 patch

Prepare to bump ggml with their impl for mxfp4

* bump

* fix windows build error

* Convert tensors at load time

Repack the mxfp4 tensors as ggmls kernels expect them to be.

* convert mlp bf16 to f32

* buffer the conversion better

* reshape earlier

* openai swiglu

* add ids

* split qkv, gate_up

* fix nested alt tags

* fast attention

* remove debug messages

* fix lint

* remove redundant test

* remap values only if source/target are different

* add back i32->i32 copy

* refactor cpu quants

* clean up vendor

* update patch instructions

* clean up patches

* remove webgpu

* update mem

* also handle gpt-oss

* revert convert changes

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-08-14 14:42:58 -07:00
Daniel Hiltgen
7ccfd97a93 doc: clarify both rocm and main bundle necessary (#11900)
Some users expect the rocm bundles to be self-sufficient, but are designed to be additive.
2025-08-14 12:54:55 -07:00
Daniel Hiltgen
c385ca8672 test: add valid responses (#11902)
some of the new models need a few more valid responses to pass
2025-08-14 11:07:13 -07:00
Daniel Hiltgen
837379a94c discovery: fix cudart driver version (#11614)
We prefer the nvcuda library, which reports driver versions. When we
dropped cuda v11, we added a safety check for too-old drivers.  What
we missed was the cudart fallback discovery logic didn't have driver
version wired up.  This fixes cudart discovery to expose the driver
version as well so we no longer reject all GPUs if nvcuda didn't work.
2025-08-13 15:43:33 -07:00
Daniel Hiltgen
a24f90604f int: adjust a few models for integration tests (#11872) 2025-08-13 15:42:36 -07:00
Daniel Hiltgen
dc5a645434 cuda: leverage JIT for smaller footprint (#11635)
Prior to this change our official binaries contained both JIT PTX code and
the cubin binary code for our chosen compute capabilities. This change
switches to only compile the PTX code and rely on JIT at runtime for
generating the cubin specific to the users GPU.  The cubins are cached
on the users system, so they should only see a small lag on the very
first model load for a given Ollama release.  This also adds the first
generation of Blackwell GPUs so they aren't reliant on the Hopper PTX.

This change reduces the ggml-cuda.dll from 1.2G to 460M
2025-08-13 15:42:16 -07:00
youzichuan
bb71654ebe chore: fix some inconsistent function name in comment
Signed-off-by: youzichuan <youzichuan6@outlook.com>
2025-08-13 09:50:27 -07:00
Jesse Gross
a343ae53a4 ggml: Use ordinal IDs for AMD GPUs on Linux when UUID is unavailable
Some AMD GPUs do not provide UUIDs and report only "XX". In these
cases, we should use the ordinal ID as an alternate identifier.
This is the same as we always need to do on Windows for AMD.

In addition, this prints out the ID for each GPU when enumerating
them for easier debugging in the future.
2025-08-12 16:56:14 -07:00
Michael Yang
d0cf6c8281 fix(openai): handle reasoning_effort (#11868) 2025-08-12 11:02:01 -07:00
Jesse Gross
8f4ec9ab28 discover: CPU supports flash attention
We already run flash attention on CPUs in cases where we have
partial offloading but were disabling it if running on pure CPU,
 which is unnecessary.
2025-08-11 15:00:34 -07:00
Devon Rifkin
dbfd7bd027 Merge pull request #11861 from ollama/drifkin/fix-parsing-error
server: fix error when parsing bad harmony tool calls
2025-08-11 14:59:57 -07:00
Devon Rifkin
ee04dbba51 server: fix error when parsing bad harmony tool calls
Thanks @moll for reporting!

Fixes: #11781
2025-08-11 14:09:13 -07:00
Daniel Andersen
ea7657b54a sched: Add support for grouping GPUs (#10678)
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:
 - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
 - Allowing unallocated GPUs to get into power-saving mode.
 - Significantly reduce VRAM allocation when using more than 2 GPUs in a system
 - Due to the reduced memory allocation, you can run more models simultaneously.
2025-08-11 13:59:38 -07:00
Michael Vorburger
2c776f0780 CONTRIBUTING: Explicitly note docs:... as a good example (#11755) 2025-08-09 18:12:30 -07:00
Jesse Gross
79f6376f5b ggml: No-alloc mode
Callers can set a backend buffer type to be no-alloc, meaning that
it does not allocate memory for tensors or operations. This can
be used for calculating memory requirements. Tensors and graphs
must be recreated with no-alloc set to false before loading data.

Defaults to false for newly created backend buffer types.
2025-08-08 14:57:13 -07:00
Jesse Gross
756c78cfc7 ggml: Support closing backends
In order to iteratively find the best memory allocation, we need to
be able to free backend memory so we can try again.
2025-08-08 14:57:13 -07:00
Jesse Gross
d7f4f788d1 ggml: Use GGML's typedef'ed pointer types
For many backend data structures, GGML defines a typedef of a pointer
type and returns these from functions. In most cases, CGo understands
that these are interchangable but some parts of Go (such as generics)
think they are two different types. We should prefer the form that
GGML uses.
2025-08-08 14:57:13 -07:00
Daniel Hiltgen
114c3f2265 tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00
Jesse Gross
f2e9c9aff5 server: Reduce gpt-oss context length for small VRAM GPUs
gpt-oss works best with a context length of at least 8k. However,
for GPUs with limited amount of VRAM, there is a significant
performance hit to this increased context. In these cases, we
switch to the Ollama default of 4k
2025-08-07 14:23:55 -07:00
Devon Rifkin
aa9d889522 Merge pull request #11765 from ollama/drifkin/thinking-without-content
openai: always provide reasoning
2025-08-06 19:02:23 -07:00
Devon Rifkin
735c41f9ca openai: always provide reasoning
We were missing passing along thinking if content was nil (as opposed
to empty string)

Also added a test for content not being passed, which was the real cause
of <https://github.com/ollama/ollama/issues/11704>, since with the way
`Content` is typed, not passing it and empty string are distinct
2025-08-06 18:54:20 -07:00
Devon Rifkin
223a619468 Merge pull request #11761 from ollama/drifkin/openai-tool-names
openai: when converting role=tool messages, propagate the tool name
2025-08-06 17:53:25 -07:00
Devon Rifkin
759dd78dd6 openai: when converting role=tool messages, propagate the tool name
Added support for converting both `name` and `tool_call_id` fields,
which different clients might provide. `name` is a legacy field from the
OpenAI completions API. For `tool_call_id` we inspect previous messages
and look for a matching tool call ID and grab its name

Issue: https://github.com/ollama/ollama/issues/11704
2025-08-06 17:00:24 -07:00
Patrick Devine
44bc36d063 docs: update the faq (#11760) 2025-08-06 16:55:57 -07:00
Devon Rifkin
8f14e1f5f6 Merge pull request #11759 from ollama/drifkin/oai-tool-calling
openai: allow for content _and_ tool calls in the same message
2025-08-06 16:11:31 -07:00
Devon Rifkin
203c137810 openai: allow for content _and_ tool calls in the same message
Previously our OpenAI chat completions compat layer assumed that tool
calls and content would never be provided together, but this is not a
correct assumption. Content is only optional when tool calls are
present, but tool calls and content can be provided together

Fixes: https://github.com/ollama/ollama/issues/11704
2025-08-06 15:50:30 -07:00
Daniel Hiltgen
fa8be9e35c clean up debugging (#11756) 2025-08-06 13:31:22 -07:00
Gao feng
8a75e9ee15 Update downloading to pulling in api.md (#11170)
update api.md to make it consist with code.
https://github.com/ollama/ollama/blob/main/server/download.go#L447
2025-08-06 11:33:09 -07:00
Parth Sareen
4742e12c23 docs: update turbo model name (#11707) 2025-08-05 17:29:08 -07:00
Devon Rifkin
2d06977ade Merge pull request #11705 from ollama/drifkin/fn-schema
tools: support anyOf types
2025-08-05 17:02:42 -07:00
Devon Rifkin
30f8a68c4c tools: support anyOf types
afaik gpt-oss is the first model that meaningfully transforms tool
function definitions in its template. We found that relatively common
definitions that include `anyOf` were not working because the template
was assuming that types were always defined via a `type` field.

anyOf allows for fully recursive types, so I exposed a
`toTypeScriptType()` function to handle this recursive logic in go and
keep the templates cleaner. The gpt-oss templates will need to be
updated to use this.

We should keep building out our function definition support to more
fully support the parts of json schema that make sense for this use
case, but in the meantime this will unblock some users (e.g., zed's
ollama integration w/ gpt-oss). Probably the most urgent is proper array
support
2025-08-05 16:46:24 -07:00
Daniel Hiltgen
e378e33421 win: static link msvc libs (#11612)
This should help reduce the runtime dependencies on windows.
2025-08-05 16:10:42 -07:00
Michael Yang
fcec04bf42 gptoss: fix memory calc (#11700) 2025-08-05 15:56:12 -07:00
Jeffrey Morgan
ee92ca3e1d docs: add docs for Ollama Turbo (#11687) 2025-08-05 13:09:10 -07:00
Jesse Gross
8253ad4d2b ggml: Prevent kv cache quanitization on gpt-oss
KV cache quantization has a dependency on the flash attention kernel.
We currently cannot use flash attention with gpt-oss as it requires
additional operations.

The model definition does not call flash attention, so it works
regardless of the setting but the cache will pick up the
quantization type. This updates the flash attention setting earlier
in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671
2025-08-05 13:04:03 -07:00
Michael Yang
fa7776fd24 gpt-oss (#11672)
* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-08-05 12:21:16 -07:00
Jesse Gross
0d38b66502 kvcache: Log contents of cache when unable to find a slot
There is a bug when using sliding window attention where we run
out of KV cache slots. This is likely due to not correctly removing
all of the entries as they slide out of range. This adds additional
logging when this occurs to track down the source.

Bug #10127
2025-08-04 16:59:29 -07:00
Jesse Gross
4183bb0574 kvcache: Enable SWA to retain additional entries
Models that use sliding window attention can only resume a sequence
from the cache if it falls within the saved windows. This works well
if the next message picks up where the old one left off. However, it
generally prevents a partial prefix match unless the entire conversation
falls within the sliding window.

This can be a problem with reasoning models where the traces are
supposed to be removed from future messages, forcing the entire
history to be re-evaluated.

This change allows models to specify that a larger amount of the
history be retained in memory, to allow more partial resumption.
It still respects the window that the model was trained on for
token generation.
2025-07-31 14:48:01 -07:00
Sajal Kulshreshtha
ff89ba90bc fixing broken AMD driver link (#11579) 2025-07-30 12:02:54 -07:00
Daniel Hiltgen
6dcc5dfb9c Revert "CI: switch back to x86 macos builder" (#11588)
This reverts commit 9d071e6089.
2025-07-30 08:56:01 -07:00
Daniel Hiltgen
25911a6e6b mac: disable bf16 on unsupported OS versions (#11585)
Support for bf16 was added in MacOS v14+ and attempting to enable
on older versions causes runtime failures.
2025-07-30 08:50:54 -07:00
762 changed files with 232794 additions and 61642 deletions

View File

@@ -23,7 +23,7 @@ jobs:
echo GOFLAGS="'-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=${GITHUB_REF_NAME#v}\" \"-X=github.com/ollama/ollama/server.mode=release\"'" >>$GITHUB_OUTPUT
darwin-build:
runs-on: macos-13
runs-on: macos-13-xlarge
environment: release
needs: setup-environment
strategy:
@@ -65,14 +65,36 @@ jobs:
arch: amd64
preset: 'CUDA 12'
install: https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_571.96_windows.exe
cuda-components:
- '"cudart"'
- '"nvcc"'
- '"cublas"'
- '"cublas_dev"'
cuda-version: '12.8'
flags: ''
runner_dir: 'cuda_v12'
- os: windows
arch: amd64
preset: 'CUDA 13'
install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
cuda-components:
- '"cudart"'
- '"nvcc"'
- '"cublas"'
- '"cublas_dev"'
- '"crt"'
- '"nvvm"'
- '"nvptxcompiler"'
cuda-version: '13.0'
flags: ''
runner_dir: 'cuda_v13'
- os: windows
arch: amd64
preset: 'ROCm 6'
install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-WinSvr2022-For-HIP.exe
rocm-version: '6.2'
flags: '-DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma" -DCMAKE_CXX_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma"'
runner_dir: 'rocm'
runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
environment: release
env:
@@ -96,7 +118,7 @@ jobs:
$ErrorActionPreference = "Stop"
if ("${{ steps.cache-install.outputs.cache-hit }}" -ne 'true') {
Invoke-WebRequest -Uri "${{ matrix.install }}" -OutFile "install.exe"
$subpackages = @("cudart", "nvcc", "cublas", "cublas_dev") | Foreach-Object {"${_}_${{ matrix.cuda-version }}"}
$subpackages = @(${{ join(matrix.cuda-components, ', ') }}) | Foreach-Object {"${_}_${{ matrix.cuda-version }}"}
Start-Process -FilePath .\install.exe -ArgumentList (@("-s") + $subpackages) -NoNewWindow -Wait
}
@@ -138,9 +160,10 @@ jobs:
run: |
Import-Module 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\Tools\Microsoft.VisualStudio.DevShell.dll'
Enter-VsDevShell -VsInstallPath 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise' -SkipAutomaticLocation -DevCmdArguments '-arch=x64 -no_logo'
cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }}
cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }} -DOLLAMA_RUNNER_DIR="${{ matrix.runner_dir }}"
cmake --build --parallel --preset "${{ matrix.preset }}"
cmake --install build --component "${{ startsWith(matrix.preset, 'CUDA ') && 'CUDA' || startsWith(matrix.preset, 'ROCm ') && 'HIP' || 'CPU' }}" --strip --parallel 8
Remove-Item -Path dist\lib\ollama\rocm\rocblas\library\*gfx906* -ErrorAction SilentlyContinue
env:
CMAKE_GENERATOR: Ninja
- uses: actions/upload-artifact@v4
@@ -153,19 +176,19 @@ jobs:
matrix:
os: [windows]
arch: [amd64, arm64]
include:
- os: windows
arch: amd64
llvmarch: x86_64
- os: windows
arch: arm64
llvmarch: aarch64
runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
environment: release
needs: [setup-environment]
env:
GOFLAGS: ${{ needs.setup-environment.outputs.GOFLAGS }}
steps:
- name: Install AMD64 system dependencies
if: matrix.arch == 'amd64'
run: |
$ErrorActionPreference = "Stop"
Start-Process "C:\msys64\usr\bin\pacman.exe" -ArgumentList @("-S", "--noconfirm", "mingw-w64-clang-x86_64-gcc-compat", "mingw-w64-clang-x86_64-clang") -NoNewWindow -Wait
echo "C:\msys64\usr\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "C:\msys64\clang64\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
- name: Install ARM64 system dependencies
if: matrix.arch == 'arm64'
run: |
@@ -177,15 +200,29 @@ jobs:
choco install -y --no-progress git gzip
echo "C:\Program Files\Git\cmd" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-aarch64.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt-aarch64.zip"
Expand-Archive -Path ${{ runner.temp }}\llvm-mingw-ucrt-aarch64.zip -DestinationPath "C:\Program Files\"
$installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt-aarch64").path
echo $installPath\bin | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
- name: Install clang and gcc-compat
run: |
$ErrorActionPreference = "Stop"
Set-ExecutionPolicy Bypass -Scope Process -Force
Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-${{ matrix.llvmarch }}.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt.zip"
Expand-Archive -Path ${{ runner.temp }}\llvm-mingw-ucrt.zip -DestinationPath "C:\Program Files\"
$installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt*").path
echo "$installPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Verify gcc is actually clang
run: |
$ErrorActionPreference='Continue'
$version=& gcc -v 2>&1
$version=$version -join "`n"
echo "gcc is $version"
if ($version -notmatch 'clang') {
echo "ERROR: GCC must be clang for proper utf16 handling"
exit 1
}
$ErrorActionPreference='Stop'
- run: |
go build -o dist/${{ matrix.os }}-${{ matrix.arch }}/ .
- uses: actions/upload-artifact@v4
@@ -200,13 +237,13 @@ jobs:
include:
- os: linux
arch: amd64
target: archive
target: archive_novulkan
- os: linux
arch: amd64
target: rocm
- os: linux
arch: arm64
target: archive
target: archive_novulkan
runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
environment: release
needs: setup-environment
@@ -232,7 +269,7 @@ jobs:
case "$COMPONENT" in
bin/ollama) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/*.so*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/cuda_sbsa) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/cuda_v*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/cuda_jetpack5) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-jetpack5.tar.in ;;
lib/ollama/cuda_jetpack6) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-jetpack6.tar.in ;;
lib/ollama/rocm) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-rocm.tar.in ;;
@@ -262,12 +299,14 @@ jobs:
include:
- os: linux
arch: arm64
target: novulkan
build-args: |
CGO_CFLAGS
CGO_CXXFLAGS
GOFLAGS
- os: linux
arch: amd64
target: novulkan
build-args: |
CGO_CFLAGS
CGO_CXXFLAGS
@@ -280,6 +319,14 @@ jobs:
CGO_CXXFLAGS
GOFLAGS
FLAVOR=rocm
- os: linux
arch: amd64
suffix: '-vulkan'
target: default
build-args: |
CGO_CFLAGS
CGO_CXXFLAGS
GOFLAGS
runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
environment: release
needs: setup-environment
@@ -297,6 +344,7 @@ jobs:
with:
context: .
platforms: ${{ matrix.os }}/${{ matrix.arch }}
target: ${{ matrix.target }}
build-args: ${{ matrix.build-args }}
outputs: type=image,name=${{ vars.DOCKER_REPO }},push-by-digest=true,name-canonical=true,push=true
cache-from: type=registry,ref=${{ vars.DOCKER_REPO }}:latest

View File

@@ -46,12 +46,18 @@ jobs:
include:
- preset: CPU
- preset: CUDA
container: nvidia/cuda:12.8.1-devel-ubuntu22.04
container: nvidia/cuda:13.0.0-devel-ubuntu22.04
flags: '-DCMAKE_CUDA_ARCHITECTURES=87'
- preset: ROCm
container: rocm/dev-ubuntu-22.04:6.1.2
extra-packages: rocm-libs
flags: '-DAMDGPU_TARGETS=gfx1010 -DCMAKE_PREFIX_PATH=/opt/rocm'
- preset: Vulkan
container: ubuntu:22.04
extra-packages: >
mesa-vulkan-drivers vulkan-tools
libvulkan1 libvulkan-dev
vulkan-sdk cmake ccache g++ make
runs-on: linux
container: ${{ matrix.container }}
steps:
@@ -59,7 +65,19 @@ jobs:
- run: |
[ -n "${{ matrix.container }}" ] || sudo=sudo
$sudo apt-get update
# Add LunarG Vulkan SDK apt repo for Ubuntu 22.04
if [ "${{ matrix.preset }}" = "Vulkan" ]; then
$sudo apt-get install -y --no-install-recommends wget gnupg ca-certificates software-properties-common
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | $sudo gpg --dearmor -o /usr/share/keyrings/lunarg-archive-keyring.gpg
# Use signed-by to bind the repo to the installed keyring to avoid NO_PUBKEY
echo "deb [signed-by=/usr/share/keyrings/lunarg-archive-keyring.gpg] https://packages.lunarg.com/vulkan/1.4.313 jammy main" | $sudo tee /etc/apt/sources.list.d/lunarg-vulkan-1.4.313-jammy.list > /dev/null
$sudo apt-get update
fi
$sudo apt-get install -y cmake ccache ${{ matrix.extra-packages }}
# Export VULKAN_SDK if provided by LunarG package (defensive)
if [ -d "/usr/lib/x86_64-linux-gnu/vulkan" ] && [ "${{ matrix.preset }}" = "Vulkan" ]; then
echo "VULKAN_SDK=/usr" >> $GITHUB_ENV
fi
env:
DEBIAN_FRONTEND: noninteractive
- uses: actions/cache@v4
@@ -78,23 +96,35 @@ jobs:
include:
- preset: CPU
- preset: CUDA
install: https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_571.96_windows.exe
install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
flags: '-DCMAKE_CUDA_ARCHITECTURES=80'
cuda-components:
- '"cudart"'
- '"nvcc"'
- '"cublas"'
- '"cublas_dev"'
- '"crt"'
- '"nvvm"'
- '"nvptxcompiler"'
cuda-version: '13.0'
- preset: ROCm
install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-WinSvr2022-For-HIP.exe
flags: '-DAMDGPU_TARGETS=gfx1010 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma" -DCMAKE_CXX_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma"'
- preset: Vulkan
install: https://sdk.lunarg.com/sdk/download/1.4.321.1/windows/vulkansdk-windows-X64-1.4.321.1.exe
runs-on: windows
steps:
- run: |
choco install -y --no-progress ccache ninja
ccache -o cache_dir=${{ github.workspace }}\.ccache
- if: matrix.preset == 'CUDA' || matrix.preset == 'ROCm'
- if: matrix.preset == 'CUDA' || matrix.preset == 'ROCm' || matrix.preset == 'Vulkan'
id: cache-install
uses: actions/cache/restore@v4
with:
path: |
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA
C:\Program Files\AMD\ROCm
C:\VulkanSDK
key: ${{ matrix.install }}
- if: matrix.preset == 'CUDA'
name: Install CUDA ${{ matrix.cuda-version }}
@@ -102,7 +132,8 @@ jobs:
$ErrorActionPreference = "Stop"
if ("${{ steps.cache-install.outputs.cache-hit }}" -ne 'true') {
Invoke-WebRequest -Uri "${{ matrix.install }}" -OutFile "install.exe"
Start-Process -FilePath .\install.exe -ArgumentList (@("-s", "cudart_12.8", "nvcc_12.8", "cublas_12.8", "cublas_dev_12.8")) -NoNewWindow -Wait
$subpackages = @(${{ join(matrix.cuda-components, ', ') }}) | Foreach-Object {"${_}_${{ matrix.cuda-version }}"}
Start-Process -FilePath .\install.exe -ArgumentList (@("-s") + $subpackages) -NoNewWindow -Wait
}
$cudaPath = (Resolve-Path "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\*").path
@@ -123,6 +154,18 @@ jobs:
echo "HIPCXX=$hipPath\bin\clang++.exe" | Out-File -FilePath $env:GITHUB_ENV -Append
echo "HIP_PLATFORM=amd" | Out-File -FilePath $env:GITHUB_ENV -Append
echo "CMAKE_PREFIX_PATH=$hipPath" | Out-File -FilePath $env:GITHUB_ENV -Append
- if: matrix.preset == 'Vulkan'
name: Install Vulkan ${{ matrix.rocm-version }}
run: |
$ErrorActionPreference = "Stop"
if ("${{ steps.cache-install.outputs.cache-hit }}" -ne 'true') {
Invoke-WebRequest -Uri "${{ matrix.install }}" -OutFile "install.exe"
Start-Process -FilePath .\install.exe -ArgumentList "-c","--am","--al","in" -NoNewWindow -Wait
}
$vulkanPath = (Resolve-Path "C:\VulkanSDK\*").path
echo "$vulkanPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "VULKAN_SDK=$vulkanPath" >> $env:GITHUB_ENV
- if: ${{ !cancelled() && steps.cache-install.outputs.cache-hit != 'true' }}
uses: actions/cache/save@v4
with:

1
.gitignore vendored
View File

@@ -6,6 +6,7 @@
dist
build
.cache
.gocache
*.exe
.idea
test_data

View File

@@ -3,6 +3,7 @@ cmake_minimum_required(VERSION 3.21)
project(Ollama C CXX)
include(CheckLanguage)
include(GNUInstallDirs)
find_package(Threads REQUIRED)
@@ -37,7 +38,7 @@ if (CMAKE_OSX_ARCHITECTURES MATCHES "x86_64")
endif()
set(OLLAMA_BUILD_DIR ${CMAKE_BINARY_DIR}/lib/ollama)
set(OLLAMA_INSTALL_DIR ${CMAKE_INSTALL_PREFIX}/lib/ollama)
set(OLLAMA_INSTALL_DIR ${CMAKE_INSTALL_PREFIX}/lib/ollama/${OLLAMA_RUNNER_DIR})
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${OLLAMA_BUILD_DIR})
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_DEBUG ${OLLAMA_BUILD_DIR})
@@ -51,7 +52,7 @@ include_directories(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/include
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cpu)
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cpu/amx)
add_compile_definitions(NDEBUG)
add_compile_definitions(NDEBUG GGML_VERSION=0x0 GGML_COMMIT=0x0)
set(GGML_CPU ON)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src)
@@ -80,7 +81,7 @@ if(CMAKE_CUDA_COMPILER)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cuda)
install(TARGETS ggml-cuda
RUNTIME_DEPENDENCIES
DIRECTORIES ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_LIBRARY_DIR}
DIRECTORIES ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_BIN_DIR}/x64 ${CUDAToolkit_LIBRARY_DIR}
PRE_INCLUDE_REGEXES cublas cublasLt cudart
PRE_EXCLUDE_REGEXES ".*"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CUDA
@@ -88,23 +89,26 @@ if(CMAKE_CUDA_COMPILER)
)
endif()
set(WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX "^gfx(906|908|90a|1200|1201):xnack[+-]$"
set(WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX "^gfx(908|90a|1200|1201):xnack[+-]$"
CACHE STRING
"Regular expression describing AMDGPU_TARGETS not supported on Windows. Override to force building these targets. Default \"^gfx(906|908|90a|1200|1201):xnack[+-]$\"."
"Regular expression describing AMDGPU_TARGETS not supported on Windows. Override to force building these targets. Default \"^gfx(908|90a|1200|1201):xnack[+-]$\"."
)
check_language(HIP)
if(CMAKE_HIP_COMPILER)
set(HIP_PLATFORM "amd")
find_package(hip REQUIRED)
if(NOT AMDGPU_TARGETS)
list(FILTER AMDGPU_TARGETS INCLUDE REGEX "^gfx(900|94[012]|101[02]|1030|110[012]|120[01])$")
elseif(WIN32 AND WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX)
find_package(hip REQUIRED)
list(FILTER AMDGPU_TARGETS INCLUDE REGEX "^gfx(94[012]|101[02]|1030|110[012]|120[01])$")
endif()
if(WIN32 AND WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX)
list(FILTER AMDGPU_TARGETS EXCLUDE REGEX ${WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX})
endif()
if(AMDGPU_TARGETS)
find_package(hip REQUIRED)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-hip)
if (WIN32)
@@ -113,7 +117,6 @@ if(CMAKE_HIP_COMPILER)
target_compile_definitions(ggml-hip PRIVATE GGML_HIP_NO_VMM)
set(OLLAMA_HIP_INSTALL_DIR ${OLLAMA_INSTALL_DIR}/rocm)
install(TARGETS ggml-hip
RUNTIME_DEPENDENCY_SET rocm
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
@@ -124,15 +127,27 @@ if(CMAKE_HIP_COMPILER)
PRE_INCLUDE_REGEXES hipblas rocblas amdhip64 rocsolver amd_comgr hsa-runtime64 rocsparse tinfo rocprofiler-register drm drm_amdgpu numa elf
PRE_EXCLUDE_REGEXES ".*"
POST_EXCLUDE_REGEXES "system32"
RUNTIME DESTINATION ${OLLAMA_HIP_INSTALL_DIR} COMPONENT HIP
LIBRARY DESTINATION ${OLLAMA_HIP_INSTALL_DIR} COMPONENT HIP
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
)
foreach(HIP_LIB_BIN_INSTALL_DIR IN ITEMS ${HIP_BIN_INSTALL_DIR} ${HIP_LIB_INSTALL_DIR})
if(EXISTS ${HIP_LIB_BIN_INSTALL_DIR}/rocblas)
install(DIRECTORY ${HIP_LIB_BIN_INSTALL_DIR}/rocblas DESTINATION ${OLLAMA_HIP_INSTALL_DIR} COMPONENT HIP)
install(DIRECTORY ${HIP_LIB_BIN_INSTALL_DIR}/rocblas DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP)
break()
endif()
endforeach()
endif()
endif()
find_package(Vulkan)
if(Vulkan_FOUND)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-vulkan)
install(TARGETS ggml-vulkan
RUNTIME_DEPENDENCIES
PRE_INCLUDE_REGEXES vulkan
PRE_EXCLUDE_REGEXES ".*"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT Vulkan
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT Vulkan
)
endif()

View File

@@ -6,7 +6,8 @@
"binaryDir": "${sourceDir}/build",
"installDir": "${sourceDir}/dist",
"cacheVariables": {
"CMAKE_BUILD_TYPE": "Release"
"CMAKE_BUILD_TYPE": "Release",
"CMAKE_MSVC_RUNTIME_LIBRARY": "MultiThreaded"
}
},
{
@@ -17,14 +18,30 @@
"name": "CUDA",
"inherits": [ "Default" ]
},
{
"name": "CUDA 11",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "50-virtual;60-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-virtual;87-virtual;89-virtual;90-virtual",
"CMAKE_CUDA_FLAGS": "-Wno-deprecated-gpu-targets -t 2"
}
},
{
"name": "CUDA 12",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "50;60;61;70;75;80;86;87;89;90;90a;120",
"CMAKE_CUDA_ARCHITECTURES": "50;52;60;61;70;75;80;86;89;90;90a;120",
"CMAKE_CUDA_FLAGS": "-Wno-deprecated-gpu-targets -t 2"
}
},
{
"name": "CUDA 13",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;87-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual",
"CMAKE_CUDA_FLAGS": "-t 2"
}
},
{
"name": "JetPack 5",
"inherits": [ "CUDA" ],
@@ -51,8 +68,12 @@
"inherits": [ "ROCm" ],
"cacheVariables": {
"CMAKE_HIP_FLAGS": "-parallel-jobs=4",
"AMDGPU_TARGETS": "gfx900;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-"
"AMDGPU_TARGETS": "gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-"
}
},
{
"name": "Vulkan",
"inherits": [ "Default" ]
}
],
"buildPresets": [
@@ -71,11 +92,21 @@
"configurePreset": "CUDA",
"targets": [ "ggml-cuda" ]
},
{
"name": "CUDA 11",
"inherits": [ "CUDA" ],
"configurePreset": "CUDA 11"
},
{
"name": "CUDA 12",
"inherits": [ "CUDA" ],
"configurePreset": "CUDA 12"
},
{
"name": "CUDA 13",
"inherits": [ "CUDA" ],
"configurePreset": "CUDA 13"
},
{
"name": "JetPack 5",
"inherits": [ "CUDA" ],
@@ -95,6 +126,11 @@
"name": "ROCm 6",
"inherits": [ "ROCm" ],
"configurePreset": "ROCm 6"
},
{
"name": "Vulkan",
"targets": [ "ggml-vulkan" ],
"configurePreset": "Vulkan"
}
]
}

View File

@@ -66,6 +66,7 @@ Examples:
llm/backend/mlx: support the llama architecture
CONTRIBUTING: provide clarity on good commit messages, and bad
docs: simplify manual installation with shorter curl commands
Bad Examples:

View File

@@ -1,11 +1,13 @@
# vim: filetype=dockerfile
ARG FLAVOR=${TARGETARCH}
ARG PARALLEL=8
ARG ROCMVERSION=6.3.3
ARG JETPACK5VERSION=r35.4.1
ARG JETPACK6VERSION=r36.4.0
ARG CMAKEVERSION=3.31.2
ARG VULKANVERSION=1.4.321.1
# We require gcc v10 minimum. v10.3 has regressions, so the rockylinux 8.5 AppStream has the latest compatible version
FROM --platform=linux/amd64 rocm/dev-almalinux-8:${ROCMVERSION}-complete AS base-amd64
@@ -16,6 +18,16 @@ RUN yum install -y yum-utils \
&& dnf install -y ccache \
&& yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
ENV PATH=/opt/rh/gcc-toolset-10/root/usr/bin:$PATH
ARG VULKANVERSION
RUN wget https://sdk.lunarg.com/sdk/download/${VULKANVERSION}/linux/vulkansdk-linux-x86_64-${VULKANVERSION}.tar.xz -O /tmp/vulkansdk-linux-x86_64-${VULKANVERSION}.tar.xz \
&& tar xvf /tmp/vulkansdk-linux-x86_64-${VULKANVERSION}.tar.xz \
&& dnf -y install ninja-build \
&& ln -s /usr/bin/python3 /usr/bin/python \
&& /${VULKANVERSION}/vulkansdk -j 8 vulkan-headers \
&& /${VULKANVERSION}/vulkansdk -j 8 shaderc
RUN cp -r /${VULKANVERSION}/x86_64/include/* /usr/local/include/ \
&& cp -r /${VULKANVERSION}/x86_64/lib/* /usr/local/lib
ENV PATH=/${VULKANVERSION}/x86_64/bin:$PATH
FROM --platform=linux/arm64 almalinux:8 AS base-arm64
# install epel-release for ccache
@@ -34,26 +46,52 @@ ENV LDFLAGS=-s
FROM base AS cpu
RUN dnf install -y gcc-toolset-11-gcc gcc-toolset-11-gcc-c++
ENV PATH=/opt/rh/gcc-toolset-11/root/usr/bin:$PATH
ARG PARALLEL
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CPU' \
&& cmake --build --parallel --preset 'CPU' \
&& cmake --install build --component CPU --strip --parallel 8
&& cmake --build --parallel ${PARALLEL} --preset 'CPU' \
&& cmake --install build --component CPU --strip --parallel ${PARALLEL}
FROM base AS cuda-11
ARG CUDA11VERSION=11.8
RUN dnf install -y cuda-toolkit-${CUDA11VERSION//./-}
ENV PATH=/usr/local/cuda-11/bin:$PATH
ARG PARALLEL
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CUDA 11' -DOLLAMA_RUNNER_DIR="cuda_v11" \
&& cmake --build --parallel ${PARALLEL} --preset 'CUDA 11' \
&& cmake --install build --component CUDA --strip --parallel ${PARALLEL}
FROM base AS cuda-12
ARG CUDA12VERSION=12.8
RUN dnf install -y cuda-toolkit-${CUDA12VERSION//./-}
ENV PATH=/usr/local/cuda-12/bin:$PATH
ARG PARALLEL
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CUDA 12' \
&& cmake --build --parallel --preset 'CUDA 12' \
&& cmake --install build --component CUDA --strip --parallel 8
cmake --preset 'CUDA 12' -DOLLAMA_RUNNER_DIR="cuda_v12"\
&& cmake --build --parallel ${PARALLEL} --preset 'CUDA 12' \
&& cmake --install build --component CUDA --strip --parallel ${PARALLEL}
FROM base AS cuda-13
ARG CUDA13VERSION=13.0
RUN dnf install -y cuda-toolkit-${CUDA13VERSION//./-}
ENV PATH=/usr/local/cuda-13/bin:$PATH
ARG PARALLEL
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CUDA 13' -DOLLAMA_RUNNER_DIR="cuda_v13" \
&& cmake --build --parallel ${PARALLEL} --preset 'CUDA 13' \
&& cmake --install build --component CUDA --strip --parallel ${PARALLEL}
FROM base AS rocm-6
ENV PATH=/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/opt/rocm/bin:/opt/rocm/hcc/bin:$PATH
ARG PARALLEL
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'ROCm 6' \
&& cmake --build --parallel --preset 'ROCm 6' \
&& cmake --install build --component HIP --strip --parallel 8
cmake --preset 'ROCm 6' -DOLLAMA_RUNNER_DIR="rocm" \
&& cmake --build --parallel ${PARALLEL} --preset 'ROCm 6' \
&& cmake --install build --component HIP --strip --parallel ${PARALLEL}
RUN rm -f dist/lib/ollama/rocm/rocblas/library/*gfx90[06]*
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK5VERSION} AS jetpack-5
ARG CMAKEVERSION
@@ -61,10 +99,11 @@ RUN apt-get update && apt-get install -y curl ccache \
&& curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
ARG PARALLEL
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'JetPack 5' \
&& cmake --build --parallel --preset 'JetPack 5' \
&& cmake --install build --component CUDA --strip --parallel 8
cmake --preset 'JetPack 5' -DOLLAMA_RUNNER_DIR="cuda_jetpack5" \
&& cmake --build --parallel ${PARALLEL} --preset 'JetPack 5' \
&& cmake --install build --component CUDA --strip --parallel ${PARALLEL}
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK6VERSION} AS jetpack-6
ARG CMAKEVERSION
@@ -72,10 +111,18 @@ RUN apt-get update && apt-get install -y curl ccache \
&& curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
ARG PARALLEL
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'JetPack 6' \
&& cmake --build --parallel --preset 'JetPack 6' \
&& cmake --install build --component CUDA --strip --parallel 8
cmake --preset 'JetPack 6' -DOLLAMA_RUNNER_DIR="cuda_jetpack6" \
&& cmake --build --parallel ${PARALLEL} --preset 'JetPack 6' \
&& cmake --install build --component CUDA --strip --parallel ${PARALLEL}
FROM base AS vulkan
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'Vulkan' -DOLLAMA_RUNNER_DIR="vulkan" \
&& cmake --build --parallel --preset 'Vulkan' \
&& cmake --install build --component Vulkan --strip --parallel 8
FROM base AS build
WORKDIR /go/src/github.com/ollama/ollama
@@ -86,29 +133,62 @@ RUN go mod download
COPY . .
ARG GOFLAGS="'-ldflags=-w -s'"
ENV CGO_ENABLED=1
ARG CGO_CFLAGS
ARG CGO_CXXFLAGS
RUN --mount=type=cache,target=/root/.cache/go-build \
go build -trimpath -buildmode=pie -o /bin/ollama .
FROM --platform=linux/amd64 scratch AS amd64
COPY --from=cuda-12 dist/lib/ollama /lib/ollama
# COPY --from=cuda-11 dist/lib/ollama/ /lib/ollama/
COPY --from=cuda-12 dist/lib/ollama /lib/ollama/
COPY --from=cuda-13 dist/lib/ollama /lib/ollama/
COPY --from=vulkan dist/lib/ollama /lib/ollama/
FROM --platform=linux/arm64 scratch AS arm64
COPY --from=cuda-12 dist/lib/ollama /lib/ollama/cuda_sbsa
COPY --from=jetpack-5 dist/lib/ollama /lib/ollama/cuda_jetpack5
COPY --from=jetpack-6 dist/lib/ollama /lib/ollama/cuda_jetpack6
# COPY --from=cuda-11 dist/lib/ollama/ /lib/ollama/
COPY --from=cuda-12 dist/lib/ollama /lib/ollama/
COPY --from=cuda-13 dist/lib/ollama/ /lib/ollama/
COPY --from=jetpack-5 dist/lib/ollama/ /lib/ollama/
COPY --from=jetpack-6 dist/lib/ollama/ /lib/ollama/
FROM scratch AS rocm
COPY --from=rocm-6 dist/lib/ollama /lib/ollama
FROM ${FLAVOR} AS archive
ARG VULKANVERSION
COPY --from=cpu dist/lib/ollama /lib/ollama
COPY --from=build /bin/ollama /bin/ollama
FROM ubuntu:24.04
# Temporary opt-out stages for Vulkan
FROM --platform=linux/amd64 scratch AS amd64_novulkan
# COPY --from=cuda-11 dist/lib/ollama/ /lib/ollama/
COPY --from=cuda-12 dist/lib/ollama /lib/ollama/
COPY --from=cuda-13 dist/lib/ollama /lib/ollama/
FROM arm64 AS arm64_novulkan
FROM ${FLAVOR}_novulkan AS archive_novulkan
COPY --from=cpu dist/lib/ollama /lib/ollama
COPY --from=build /bin/ollama /bin/ollama
FROM ubuntu:24.04 AS novulkan
RUN apt-get update \
&& apt-get install -y ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
COPY --from=archive_novulkan /bin /usr/bin
ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
COPY --from=archive_novulkan /lib/ollama /usr/lib/ollama
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_VISIBLE_DEVICES=all
ENV OLLAMA_HOST=0.0.0.0:11434
EXPOSE 11434
ENTRYPOINT ["/bin/ollama"]
CMD ["serve"]
FROM ubuntu:24.04 AS default
RUN apt-get update \
&& apt-get install -y ca-certificates libvulkan1 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
COPY --from=archive /bin /usr/bin
ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
COPY --from=archive /lib/ollama /usr/lib/ollama

View File

@@ -1,6 +1,6 @@
UPSTREAM=https://github.com/ggerganov/llama.cpp.git
UPSTREAM=https://github.com/ggml-org/llama.cpp.git
WORKDIR=llama/vendor
FETCH_HEAD=de4c07f93783a1a96456a44dc16b9db538ee1618
FETCH_HEAD=7049736b2dd9011bf819e298b844ebbc4b5afdc9
.PHONY: help
help:
@@ -12,7 +12,7 @@ help:
@echo " clean Clean local repository"
@echo
@echo "Example:"
@echo " make -f $(lastword $(MAKEFILE_LIST)) clean sync"
@echo " make -f $(lastword $(MAKEFILE_LIST)) clean apply-patches sync"
.PHONY: sync
sync: llama/build-info.cpp ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal
@@ -24,12 +24,12 @@ ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal: ml/backend/ggml/ggml
go generate ./$(@D)
.PHONY: llama/llama.cpp
llama/llama.cpp: llama/vendor/
rsync -arvzc -f "merge $@/.rsync-filter" $< $@
llama/llama.cpp: llama/vendor
rsync -arvzc --delete -f "include LICENSE" -f "merge $@/.rsync-filter" $(addprefix $<,/LICENSE /) $@
.PHONY: ml/backend/ggml/ggml
ml/backend/ggml/ggml: llama/vendor/ggml/
rsync -arvzc -f "merge $@/.rsync-filter" $< $@
ml/backend/ggml/ggml: llama/vendor
rsync -arvzc --delete -f "include LICENSE" -f "merge $@/.rsync-filter" $(addprefix $<,/LICENSE /ggml/) $@
PATCHES=$(wildcard llama/patches/*.patch)
PATCHED=$(join $(dir $(PATCHES)), $(addsuffix ed, $(addprefix ., $(notdir $(PATCHES)))))
@@ -39,7 +39,15 @@ PATCHED=$(join $(dir $(PATCHES)), $(addsuffix ed, $(addprefix ., $(notdir $(PATC
apply-patches: $(PATCHED)
llama/patches/.%.patched: llama/patches/%.patch
@if git -c user.name=nobody -c 'user.email=<>' -C $(WORKDIR) am -3 $(realpath $<); then touch $@; else git -C $(WORKDIR) am --abort; exit 1; fi
@if git -c user.name=nobody -c 'user.email=<>' -C $(WORKDIR) am -3 $(realpath $<); then \
touch $@; \
else \
echo "Patch failed. Resolve any conflicts then continue."; \
echo "1. Run 'git -C $(WORKDIR) am --continue'"; \
echo "2. Run 'make -f $(lastword $(MAKEFILE_LIST)) format-patches'"; \
echo "3. Run 'make -f $(lastword $(MAKEFILE_LIST)) clean apply-patches'"; \
exit 1; \
fi
.PHONY: checkout
checkout: $(WORKDIR)
@@ -60,4 +68,5 @@ format-patches: llama/patches
.PHONE: clean
clean: checkout
@git -C $(WORKDIR) am --abort || true
$(RM) llama/patches/.*.patched

View File

@@ -411,6 +411,10 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [ollama launcher](https://github.com/NGC13009/ollama-launcher) (A launcher for Ollama, aiming to provide users with convenient functions such as ollama server launching, management, or configuration.)
- [ai-hub](https://github.com/Aj-Seven/ai-hub) (AI Hub supports multiple models via API keys and Chat support via Ollama API.)
- [Mayan EDMS](https://gitlab.com/mayan-edms/mayan-edms) (Open source document management system to organize, tag, search, and automate your files with powerful Ollama driven workflows.)
- [Serene Pub](https://github.com/doolijb/serene-pub) (Beginner friendly, open source AI Roleplaying App for Windows, Mac OS and Linux. Search, download and use models with Ollama all inside the app.)
- [Andes](https://github.com/aqerd/andes) (A Visual Studio Code extension that provides a local UI interface for Ollama models)
- [Clueless](https://github.com/KashyapTan/clueless) (Open Source & Local Cluely: A desktop application LLM assistant to help you talk to anything on your screen using locally served Ollama models. Also undetectable to screenshare)
- [ollama-co2](https://github.com/carbonatedWaterOrg/ollama-co2) (FastAPI web interface for monitoring and managing local and remote Ollama servers with real-time model monitoring and concurrent downloads)
### Cloud
@@ -457,6 +461,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [AWS-Strands-With-Ollama](https://github.com/rapidarchitect/ollama_strands) - AWS Strands Agents with Ollama Examples
- [ollama-multirun](https://github.com/attogram/ollama-multirun) - A bash shell script to run a single prompt against any or all of your locally installed ollama models, saving the output and performance statistics as easily navigable web pages. ([Demo](https://attogram.github.io/ai_test_zone/))
- [ollama-bash-toolshed](https://github.com/attogram/ollama-bash-toolshed) - Bash scripts to chat with tool using models. Add new tools to your shed with ease. Runs on Ollama.
- [VT Code](https://github.com/vinhnx/vtcode) - VT Code is a Rust-based terminal coding agent with semantic code intelligence via Tree-sitter. Ollama integration for running local/cloud models with configurable endpoints.
### Apple Vision Pro
@@ -537,6 +542,10 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [Nichey](https://github.com/goodreasonai/nichey) is a Python package for generating custom wikis for your research topic
- [Ollama for D](https://github.com/kassane/ollama-d)
- [OllamaPlusPlus](https://github.com/HardCodeDev777/OllamaPlusPlus) (Very simple C++ library for Ollama)
- [any-llm](https://github.com/mozilla-ai/any-llm) (A single interface to use different llm providers by [mozilla.ai](https://www.mozilla.ai/))
- [any-agent](https://github.com/mozilla-ai/any-agent) (A single interface to use and evaluate different agent frameworks by [mozilla.ai](https://www.mozilla.ai/))
- [Neuro SAN](https://github.com/cognizant-ai-lab/neuro-san-studio) (Data-driven multi-agent orchestration framework) with [example](https://github.com/cognizant-ai-lab/neuro-san-studio/blob/main/docs/user_guide.md#ollama)
- [achatbot-go](https://github.com/ai-bot-pro/achatbot-go) a multimodal(text/audio/image) chatbot.
### Mobile
@@ -597,6 +606,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [UnityCodeLama](https://github.com/HardCodeDev777/UnityCodeLama) (Unity Edtior tool to analyze scripts via Ollama)
- [NativeMind](https://github.com/NativeMindBrowser/NativeMindExtension) (Private, on-device AI Assistant, no cloud dependencies)
- [GMAI - Gradle Managed AI](https://gmai.premex.se/) (Gradle plugin for automated Ollama lifecycle management during build phases)
- [NOMYO Router](https://github.com/nomyo-ai/nomyo-router) (A transparent Ollama proxy with model deployment aware routing which auto-manages multiple Ollama instances in a given network)
### Supported backends

View File

@@ -42,26 +42,15 @@ type Client struct {
func checkError(resp *http.Response, body []byte) error {
if resp.StatusCode < http.StatusBadRequest {
if len(body) == 0 {
return nil
}
// streams can contain error message even with StatusOK
var errorResponse struct {
Error string `json:"error,omitempty"`
}
if err := json.Unmarshal(body, &errorResponse); err != nil {
return fmt.Errorf("unmarshal: %w", err)
}
if errorResponse.Error != "" {
return errors.New(errorResponse.Error)
}
return nil
}
if resp.StatusCode == http.StatusUnauthorized {
authError := AuthorizationError{StatusCode: resp.StatusCode}
json.Unmarshal(body, &authError)
return authError
}
apiError := StatusError{StatusCode: resp.StatusCode}
err := json.Unmarshal(body, &apiError)
@@ -230,9 +219,32 @@ func (c *Client) stream(ctx context.Context, method, path string, data any, fn f
scanBuf := make([]byte, 0, maxBufferSize)
scanner.Buffer(scanBuf, maxBufferSize)
for scanner.Scan() {
var errorResponse struct {
Error string `json:"error,omitempty"`
SigninURL string `json:"signin_url,omitempty"`
}
bts := scanner.Bytes()
if err := checkError(response, bts); err != nil {
return err
if err := json.Unmarshal(bts, &errorResponse); err != nil {
return fmt.Errorf("unmarshal: %w", err)
}
if response.StatusCode == http.StatusUnauthorized {
return AuthorizationError{
StatusCode: response.StatusCode,
Status: response.Status,
SigninURL: errorResponse.SigninURL,
}
} else if response.StatusCode >= http.StatusBadRequest {
return StatusError{
StatusCode: response.StatusCode,
Status: response.Status,
ErrorMessage: errorResponse.Error,
}
}
if errorResponse.Error != "" {
return errors.New(errorResponse.Error)
}
if err := fn(bts); err != nil {
@@ -429,3 +441,21 @@ func (c *Client) Version(ctx context.Context) (string, error) {
return version.Version, nil
}
// Signout will signout a client for a local ollama server.
func (c *Client) Signout(ctx context.Context) error {
return c.do(ctx, http.MethodPost, "/api/signout", nil, nil)
}
// Disconnect will disconnect an ollama instance from ollama.com.
func (c *Client) Disconnect(ctx context.Context, encodedKey string) error {
return c.do(ctx, http.MethodDelete, fmt.Sprintf("/api/user/keys/%s", encodedKey), nil, nil)
}
func (c *Client) Whoami(ctx context.Context) (*UserResponse, error) {
var resp UserResponse
if err := c.do(ctx, http.MethodPost, "/api/me", nil, &resp); err != nil {
return nil, err
}
return &resp, nil
}

View File

@@ -89,6 +89,16 @@ func TestClientStream(t *testing.T) {
},
wantErr: "mid-stream error",
},
{
name: "http status error takes precedence over general error",
responses: []any{
testError{
message: "custom error message",
statusCode: http.StatusInternalServerError,
},
},
wantErr: "500",
},
{
name: "successful stream completion",
responses: []any{

View File

@@ -11,6 +11,8 @@ import (
"strings"
"time"
"github.com/google/uuid"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/types/model"
)
@@ -36,6 +38,19 @@ func (e StatusError) Error() string {
}
}
type AuthorizationError struct {
StatusCode int
Status string
SigninURL string `json:"signin_url"`
}
func (e AuthorizationError) Error() string {
if e.Status != "" {
return e.Status
}
return "something went wrong, please see the ollama server logs for details"
}
// ImageData represents the raw binary data of an image file.
type ImageData []byte
@@ -85,10 +100,23 @@ type GenerateRequest struct {
Options map[string]any `json:"options"`
// Think controls whether thinking/reasoning models will think before
// responding. Needs to be a pointer so we can distinguish between false
// responding. Can be a boolean (true/false) or a string ("high", "medium", "low")
// for supported models. Needs to be a pointer so we can distinguish between false
// (request that thinking _not_ be used) and unset (use the old behavior
// before this option was introduced)
Think *bool `json:"think,omitempty"`
Think *ThinkValue `json:"think,omitempty"`
// Truncate is a boolean that, when set to true, truncates the chat history messages
// if the rendered prompt exceeds the context length limit.
Truncate *bool `json:"truncate,omitempty"`
// Shift is a boolean that, when set to true, shifts the chat history
// when hitting the context length limit instead of erroring.
Shift *bool `json:"shift,omitempty"`
// DebugRenderOnly is a debug option that, when set to true, returns the rendered
// template instead of calling the model.
DebugRenderOnly bool `json:"_debug_render_only,omitempty"`
}
// ChatRequest describes a request sent by [Client.Chat].
@@ -116,8 +144,21 @@ type ChatRequest struct {
Options map[string]any `json:"options"`
// Think controls whether thinking/reasoning models will think before
// responding
Think *bool `json:"think,omitempty"`
// responding. Can be a boolean (true/false) or a string ("high", "medium", "low")
// for supported models.
Think *ThinkValue `json:"think,omitempty"`
// Truncate is a boolean that, when set to true, truncates the chat history messages
// if the rendered prompt exceeds the context length limit.
Truncate *bool `json:"truncate,omitempty"`
// Shift is a boolean that, when set to true, shifts the chat history
// when hitting the context length limit instead of erroring.
Shift *bool `json:"shift,omitempty"`
// DebugRenderOnly is a debug option that, when set to true, returns the rendered
// template instead of calling the model.
DebugRenderOnly bool `json:"_debug_render_only,omitempty"`
}
type Tools []Tool
@@ -163,7 +204,7 @@ type ToolCall struct {
}
type ToolCallFunction struct {
Index int `json:"index,omitempty"`
Index int `json:"index"`
Name string `json:"name"`
Arguments ToolCallFunctionArguments `json:"arguments"`
}
@@ -223,21 +264,76 @@ func (pt PropertyType) String() string {
return fmt.Sprintf("%v", []string(pt))
}
type ToolProperty struct {
AnyOf []ToolProperty `json:"anyOf,omitempty"`
Type PropertyType `json:"type,omitempty"`
Items any `json:"items,omitempty"`
Description string `json:"description,omitempty"`
Enum []any `json:"enum,omitempty"`
}
// ToTypeScriptType converts a ToolProperty to a TypeScript type string
func (tp ToolProperty) ToTypeScriptType() string {
if len(tp.AnyOf) > 0 {
var types []string
for _, anyOf := range tp.AnyOf {
types = append(types, anyOf.ToTypeScriptType())
}
return strings.Join(types, " | ")
}
if len(tp.Type) == 0 {
return "any"
}
if len(tp.Type) == 1 {
return mapToTypeScriptType(tp.Type[0])
}
var types []string
for _, t := range tp.Type {
types = append(types, mapToTypeScriptType(t))
}
return strings.Join(types, " | ")
}
// mapToTypeScriptType maps JSON Schema types to TypeScript types
func mapToTypeScriptType(jsonType string) string {
switch jsonType {
case "string":
return "string"
case "number", "integer":
return "number"
case "boolean":
return "boolean"
case "array":
return "any[]"
case "object":
return "Record<string, any>"
case "null":
return "null"
default:
return "any"
}
}
type ToolFunctionParameters struct {
Type string `json:"type"`
Defs any `json:"$defs,omitempty"`
Items any `json:"items,omitempty"`
Required []string `json:"required"`
Properties map[string]ToolProperty `json:"properties"`
}
func (t *ToolFunctionParameters) String() string {
bts, _ := json.Marshal(t)
return string(bts)
}
type ToolFunction struct {
Name string `json:"name"`
Description string `json:"description"`
Parameters struct {
Type string `json:"type"`
Defs any `json:"$defs,omitempty"`
Items any `json:"items,omitempty"`
Required []string `json:"required"`
Properties map[string]struct {
Type PropertyType `json:"type"`
Items any `json:"items,omitempty"`
Description string `json:"description"`
Enum []any `json:"enum,omitempty"`
} `json:"properties"`
} `json:"parameters"`
Name string `json:"name"`
Description string `json:"description,omitempty"`
Parameters ToolFunctionParameters `json:"parameters"`
}
func (t *ToolFunction) String() string {
@@ -248,16 +344,38 @@ func (t *ToolFunction) String() string {
// ChatResponse is the response returned by [Client.Chat]. Its fields are
// similar to [GenerateResponse].
type ChatResponse struct {
Model string `json:"model"`
CreatedAt time.Time `json:"created_at"`
Message Message `json:"message"`
DoneReason string `json:"done_reason,omitempty"`
// Model is the model name that generated the response.
Model string `json:"model"`
// RemoteModel is the name of the upstream model that generated the response.
RemoteModel string `json:"remote_model,omitempty"`
// RemoteHost is the URL of the upstream Ollama host that generated the response.
RemoteHost string `json:"remote_host,omitempty"`
// CreatedAt is the timestamp of the response.
CreatedAt time.Time `json:"created_at"`
// Message contains the message or part of a message from the model.
Message Message `json:"message"`
// Done specifies if the response is complete.
Done bool `json:"done"`
// DoneReason is the reason the model stopped generating text.
DoneReason string `json:"done_reason,omitempty"`
DebugInfo *DebugInfo `json:"_debug_info,omitempty"`
Metrics
}
// DebugInfo contains debug information for template rendering
type DebugInfo struct {
RenderedTemplate string `json:"rendered_template"`
ImageCount int `json:"image_count,omitempty"`
}
type Metrics struct {
TotalDuration time.Duration `json:"total_duration,omitempty"`
LoadDuration time.Duration `json:"load_duration,omitempty"`
@@ -310,8 +428,12 @@ type EmbedRequest struct {
// this request.
KeepAlive *Duration `json:"keep_alive,omitempty"`
// Truncate truncates the input to fit the model's max sequence length.
Truncate *bool `json:"truncate,omitempty"`
// Dimensions truncates the output embedding to the specified dimension.
Dimensions int `json:"dimensions,omitempty"`
// Options lists model-specific options.
Options map[string]any `json:"options"`
}
@@ -349,18 +471,47 @@ type EmbeddingResponse struct {
// CreateRequest is the request passed to [Client.Create].
type CreateRequest struct {
Model string `json:"model"`
Stream *bool `json:"stream,omitempty"`
// Model is the model name to create.
Model string `json:"model"`
// Stream specifies whether the response is streaming; it is true by default.
Stream *bool `json:"stream,omitempty"`
// Quantize is the quantization format for the model; leave blank to not change the quantization level.
Quantize string `json:"quantize,omitempty"`
From string `json:"from,omitempty"`
Files map[string]string `json:"files,omitempty"`
Adapters map[string]string `json:"adapters,omitempty"`
Template string `json:"template,omitempty"`
License any `json:"license,omitempty"`
System string `json:"system,omitempty"`
Parameters map[string]any `json:"parameters,omitempty"`
Messages []Message `json:"messages,omitempty"`
// From is the name of the model or file to use as the source.
From string `json:"from,omitempty"`
// RemoteHost is the URL of the upstream ollama API for the model (if any).
RemoteHost string `json:"remote_host,omitempty"`
// Files is a map of files include when creating the model.
Files map[string]string `json:"files,omitempty"`
// Adapters is a map of LoRA adapters to include when creating the model.
Adapters map[string]string `json:"adapters,omitempty"`
// Template is the template used when constructing a request to the model.
Template string `json:"template,omitempty"`
// License is a string or list of strings for licenses.
License any `json:"license,omitempty"`
// System is the system prompt for the model.
System string `json:"system,omitempty"`
// Parameters is a map of hyper-parameters which are applied to the model.
Parameters map[string]any `json:"parameters,omitempty"`
// Messages is a list of messages added to the model before chat and generation requests.
Messages []Message `json:"messages,omitempty"`
Renderer string `json:"renderer,omitempty"`
Parser string `json:"parser,omitempty"`
// Info is a map of additional information for the model
Info map[string]any `json:"info,omitempty"`
// Deprecated: set the model name with Model instead
Name string `json:"name"`
@@ -398,8 +549,12 @@ type ShowResponse struct {
Parameters string `json:"parameters,omitempty"`
Template string `json:"template,omitempty"`
System string `json:"system,omitempty"`
Renderer string `json:"renderer,omitempty"`
Parser string `json:"parser,omitempty"`
Details ModelDetails `json:"details,omitempty"`
Messages []Message `json:"messages,omitempty"`
RemoteModel string `json:"remote_model,omitempty"`
RemoteHost string `json:"remote_host,omitempty"`
ModelInfo map[string]any `json:"model_info,omitempty"`
ProjectorInfo map[string]any `json:"projector_info,omitempty"`
Tensors []Tensor `json:"tensors,omitempty"`
@@ -458,12 +613,14 @@ type ProcessResponse struct {
// ListModelResponse is a single model description in [ListResponse].
type ListModelResponse struct {
Name string `json:"name"`
Model string `json:"model"`
ModifiedAt time.Time `json:"modified_at"`
Size int64 `json:"size"`
Digest string `json:"digest"`
Details ModelDetails `json:"details,omitempty"`
Name string `json:"name"`
Model string `json:"model"`
RemoteModel string `json:"remote_model,omitempty"`
RemoteHost string `json:"remote_host,omitempty"`
ModifiedAt time.Time `json:"modified_at"`
Size int64 `json:"size"`
Digest string `json:"digest"`
Details ModelDetails `json:"details,omitempty"`
}
// ProcessModelResponse is a single model description in [ProcessResponse].
@@ -487,6 +644,12 @@ type GenerateResponse struct {
// Model is the model name that generated the response.
Model string `json:"model"`
// RemoteModel is the name of the upstream model that generated the response.
RemoteModel string `json:"remote_model,omitempty"`
// RemoteHost is the URL of the upstream Ollama host that generated the response.
RemoteHost string `json:"remote_host,omitempty"`
// CreatedAt is the timestamp of the response.
CreatedAt time.Time `json:"created_at"`
@@ -508,6 +671,10 @@ type GenerateResponse struct {
Context []int `json:"context,omitempty"`
Metrics
ToolCalls []ToolCall `json:"tool_calls,omitempty"`
DebugInfo *DebugInfo `json:"_debug_info,omitempty"`
}
// ModelDetails provides details about a model.
@@ -520,6 +687,18 @@ type ModelDetails struct {
QuantizationLevel string `json:"quantization_level"`
}
// UserResponse provides information about a user.
type UserResponse struct {
ID uuid.UUID `json:"id"`
Email string `json:"email"`
Name string `json:"name"`
Bio string `json:"bio,omitempty"`
AvatarURL string `json:"avatarurl,omitempty"`
FirstName string `json:"firstname,omitempty"`
LastName string `json:"lastname,omitempty"`
Plan string `json:"plan,omitempty"`
}
// Tensor describes the metadata for a given tensor.
type Tensor struct {
Name string `json:"name"`
@@ -677,6 +856,113 @@ func DefaultOptions() Options {
}
}
// ThinkValue represents a value that can be a boolean or a string ("high", "medium", "low")
type ThinkValue struct {
// Value can be a bool or string
Value interface{}
}
// IsValid checks if the ThinkValue is valid
func (t *ThinkValue) IsValid() bool {
if t == nil || t.Value == nil {
return true // nil is valid (means not set)
}
switch v := t.Value.(type) {
case bool:
return true
case string:
return v == "high" || v == "medium" || v == "low"
default:
return false
}
}
// IsBool returns true if the value is a boolean
func (t *ThinkValue) IsBool() bool {
if t == nil || t.Value == nil {
return false
}
_, ok := t.Value.(bool)
return ok
}
// IsString returns true if the value is a string
func (t *ThinkValue) IsString() bool {
if t == nil || t.Value == nil {
return false
}
_, ok := t.Value.(string)
return ok
}
// Bool returns the value as a bool (true if enabled in any way)
func (t *ThinkValue) Bool() bool {
if t == nil || t.Value == nil {
return false
}
switch v := t.Value.(type) {
case bool:
return v
case string:
// Any string value ("high", "medium", "low") means thinking is enabled
return v == "high" || v == "medium" || v == "low"
default:
return false
}
}
// String returns the value as a string
func (t *ThinkValue) String() string {
if t == nil || t.Value == nil {
return ""
}
switch v := t.Value.(type) {
case string:
return v
case bool:
if v {
return "medium" // Default level when just true
}
return ""
default:
return ""
}
}
// UnmarshalJSON implements json.Unmarshaler
func (t *ThinkValue) UnmarshalJSON(data []byte) error {
// Try to unmarshal as bool first
var b bool
if err := json.Unmarshal(data, &b); err == nil {
t.Value = b
return nil
}
// Try to unmarshal as string
var s string
if err := json.Unmarshal(data, &s); err == nil {
// Validate string values
if s != "high" && s != "medium" && s != "low" {
return fmt.Errorf("invalid think value: %q (must be \"high\", \"medium\", \"low\", true, or false)", s)
}
t.Value = s
return nil
}
return fmt.Errorf("think must be a boolean or string (\"high\", \"medium\", \"low\", true, or false)")
}
// MarshalJSON implements json.Marshaler
func (t *ThinkValue) MarshalJSON() ([]byte, error) {
if t == nil || t.Value == nil {
return []byte("null"), nil
}
return json.Marshal(t.Value)
}
type Duration struct {
time.Duration
}
@@ -701,7 +987,7 @@ func (d *Duration) UnmarshalJSON(b []byte) (err error) {
if t < 0 {
d.Duration = time.Duration(math.MaxInt64)
} else {
d.Duration = time.Duration(int(t) * int(time.Second))
d.Duration = time.Duration(t * float64(time.Second))
}
case string:
d.Duration, err = time.ParseDuration(t)

View File

@@ -17,6 +17,11 @@ func TestKeepAliveParsingFromJSON(t *testing.T) {
req string
exp *Duration
}{
{
name: "Unset",
req: `{ }`,
exp: nil,
},
{
name: "Positive Integer",
req: `{ "keep_alive": 42 }`,
@@ -25,7 +30,7 @@ func TestKeepAliveParsingFromJSON(t *testing.T) {
{
name: "Positive Float",
req: `{ "keep_alive": 42.5 }`,
exp: &Duration{42 * time.Second},
exp: &Duration{42500 * time.Millisecond},
},
{
name: "Positive Integer String",
@@ -293,6 +298,30 @@ func TestToolFunction_UnmarshalJSON(t *testing.T) {
}
}
func TestToolCallFunction_IndexAlwaysMarshals(t *testing.T) {
fn := ToolCallFunction{
Name: "echo",
Arguments: ToolCallFunctionArguments{"message": "hi"},
}
data, err := json.Marshal(fn)
require.NoError(t, err)
raw := map[string]any{}
require.NoError(t, json.Unmarshal(data, &raw))
require.Contains(t, raw, "index")
assert.Equal(t, float64(0), raw["index"])
fn.Index = 3
data, err = json.Marshal(fn)
require.NoError(t, err)
raw = map[string]any{}
require.NoError(t, json.Unmarshal(data, &raw))
require.Contains(t, raw, "index")
assert.Equal(t, float64(3), raw["index"])
}
func TestPropertyType_UnmarshalJSON(t *testing.T) {
tests := []struct {
name string
@@ -374,24 +403,21 @@ func TestPropertyType_MarshalJSON(t *testing.T) {
}
func TestThinking_UnmarshalJSON(t *testing.T) {
trueVal := true
falseVal := false
tests := []struct {
name string
input string
expectedThinking *bool
expectedThinking *ThinkValue
expectedError bool
}{
{
name: "true",
input: `{ "think": true }`,
expectedThinking: &trueVal,
expectedThinking: &ThinkValue{Value: true},
},
{
name: "false",
input: `{ "think": false }`,
expectedThinking: &falseVal,
expectedThinking: &ThinkValue{Value: false},
},
{
name: "unset",
@@ -399,8 +425,23 @@ func TestThinking_UnmarshalJSON(t *testing.T) {
expectedThinking: nil,
},
{
name: "invalid",
input: `{ "think": "true" }`,
name: "string_high",
input: `{ "think": "high" }`,
expectedThinking: &ThinkValue{Value: "high"},
},
{
name: "string_medium",
input: `{ "think": "medium" }`,
expectedThinking: &ThinkValue{Value: "medium"},
},
{
name: "string_low",
input: `{ "think": "low" }`,
expectedThinking: &ThinkValue{Value: "low"},
},
{
name: "invalid_string",
input: `{ "think": "invalid" }`,
expectedThinking: nil,
expectedError: true,
},
@@ -414,8 +455,60 @@ func TestThinking_UnmarshalJSON(t *testing.T) {
require.Error(t, err)
} else {
require.NoError(t, err)
assert.Equal(t, test.expectedThinking, req.Think)
if test.expectedThinking == nil {
assert.Nil(t, req.Think)
} else {
require.NotNil(t, req.Think)
assert.Equal(t, test.expectedThinking.Value, req.Think.Value)
}
}
})
}
}
func TestToolFunctionParameters_String(t *testing.T) {
tests := []struct {
name string
params ToolFunctionParameters
expected string
}{
{
name: "simple object with string property",
params: ToolFunctionParameters{
Type: "object",
Required: []string{"name"},
Properties: map[string]ToolProperty{
"name": {
Type: PropertyType{"string"},
Description: "The name of the person",
},
},
},
expected: `{"type":"object","required":["name"],"properties":{"name":{"type":"string","description":"The name of the person"}}}`,
},
{
name: "marshal failure returns empty string",
params: ToolFunctionParameters{
Type: "object",
Defs: func() any {
// Create a cycle that will cause json.Marshal to fail
type selfRef struct {
Self *selfRef
}
s := &selfRef{}
s.Self = s
return s
}(),
Properties: map[string]ToolProperty{},
},
expected: "",
},
}
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
result := test.params.String()
assert.Equal(t, test.expected, result)
})
}
}

View File

@@ -0,0 +1,142 @@
package api
import (
"testing"
)
func TestToolParameterToTypeScriptType(t *testing.T) {
tests := []struct {
name string
param ToolProperty
expected string
}{
{
name: "single string type",
param: ToolProperty{
Type: PropertyType{"string"},
},
expected: "string",
},
{
name: "single number type",
param: ToolProperty{
Type: PropertyType{"number"},
},
expected: "number",
},
{
name: "integer maps to number",
param: ToolProperty{
Type: PropertyType{"integer"},
},
expected: "number",
},
{
name: "boolean type",
param: ToolProperty{
Type: PropertyType{"boolean"},
},
expected: "boolean",
},
{
name: "array type",
param: ToolProperty{
Type: PropertyType{"array"},
},
expected: "any[]",
},
{
name: "object type",
param: ToolProperty{
Type: PropertyType{"object"},
},
expected: "Record<string, any>",
},
{
name: "null type",
param: ToolProperty{
Type: PropertyType{"null"},
},
expected: "null",
},
{
name: "multiple types as union",
param: ToolProperty{
Type: PropertyType{"string", "number"},
},
expected: "string | number",
},
{
name: "string or null union",
param: ToolProperty{
Type: PropertyType{"string", "null"},
},
expected: "string | null",
},
{
name: "anyOf with single types",
param: ToolProperty{
AnyOf: []ToolProperty{
{Type: PropertyType{"string"}},
{Type: PropertyType{"number"}},
},
},
expected: "string | number",
},
{
name: "anyOf with multiple types in each branch",
param: ToolProperty{
AnyOf: []ToolProperty{
{Type: PropertyType{"string", "null"}},
{Type: PropertyType{"number"}},
},
},
expected: "string | null | number",
},
{
name: "nested anyOf",
param: ToolProperty{
AnyOf: []ToolProperty{
{Type: PropertyType{"boolean"}},
{
AnyOf: []ToolProperty{
{Type: PropertyType{"string"}},
{Type: PropertyType{"number"}},
},
},
},
},
expected: "boolean | string | number",
},
{
name: "empty type returns any",
param: ToolProperty{
Type: PropertyType{},
},
expected: "any",
},
{
name: "unknown type maps to any",
param: ToolProperty{
Type: PropertyType{"unknown_type"},
},
expected: "any",
},
{
name: "multiple types including array",
param: ToolProperty{
Type: PropertyType{"string", "array", "null"},
},
expected: "string | any[] | null",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := tt.param.ToTypeScriptType()
if result != tt.expected {
t.Errorf("ToTypeScriptType() = %q, want %q", result, tt.expected)
}
})
}
}

View File

@@ -18,56 +18,13 @@ import (
const defaultPrivateKey = "id_ed25519"
var ErrInvalidToken = errors.New("invalid token")
func keyPath() (string, error) {
func GetPublicKey() (string, error) {
home, err := os.UserHomeDir()
if err != nil {
return "", err
}
return filepath.Join(home, ".ollama", defaultPrivateKey), nil
}
func parseToken(token string) (key, sig []byte, _ error) {
keyData, sigData, ok := strings.Cut(token, ":")
if !ok {
return nil, nil, fmt.Errorf("identity: parseToken: %w", ErrInvalidToken)
}
sig, err := base64.StdEncoding.DecodeString(sigData)
if err != nil {
return nil, nil, fmt.Errorf("identity: parseToken: base64 decoding signature: %w", err)
}
return []byte(keyData), sig, nil
}
func Authenticate(token, checkData string) (ssh.PublicKey, error) {
keyShort, sigBytes, err := parseToken(token)
if err != nil {
return nil, err
}
keyLong := append([]byte("ssh-ed25519 "), keyShort...)
pub, _, _, _, err := ssh.ParseAuthorizedKey(keyLong)
if err != nil {
return nil, err
}
if err := pub.Verify([]byte(checkData), &ssh.Signature{
Format: pub.Type(),
Blob: sigBytes,
}); err != nil {
return nil, err
}
return pub, nil
}
func GetPublicKey() (string, error) {
keyPath, err := keyPath()
if err != nil {
return "", err
}
keyPath := filepath.Join(home, ".ollama", defaultPrivateKey)
privateKeyFile, err := os.ReadFile(keyPath)
if err != nil {
slog.Info(fmt.Sprintf("Failed to load private key: %v", err))
@@ -94,11 +51,12 @@ func NewNonce(r io.Reader, length int) (string, error) {
}
func Sign(ctx context.Context, bts []byte) (string, error) {
keyPath, err := keyPath()
home, err := os.UserHomeDir()
if err != nil {
return "", err
}
keyPath := filepath.Join(home, ".ollama", defaultPrivateKey)
privateKeyFile, err := os.ReadFile(keyPath)
if err != nil {
slog.Info(fmt.Sprintf("Failed to load private key: %v", err))

View File

@@ -1,254 +0,0 @@
package auth
import (
"bufio"
"encoding/base64"
"fmt"
"io"
"log/slog"
"os"
"path/filepath"
"regexp"
"strings"
"sync"
"time"
"golang.org/x/crypto/ssh"
)
type KeyEntry struct {
Name string
PublicKey string
Endpoints []string
}
type KeyPermission struct {
Name string
Endpoints []string
}
type APIPermissions struct {
permissions map[string]*KeyPermission
lastModified time.Time
mutex sync.RWMutex
}
var ws = regexp.MustCompile(`\s+`)
func authkeyPath() (string, error) {
home, err := os.UserHomeDir()
if err != nil {
return "", err
}
return filepath.Join(home, ".ollama", "authorized_keys"), nil
}
func NewAPIPermissions() *APIPermissions {
return &APIPermissions{
permissions: make(map[string]*KeyPermission),
mutex: sync.RWMutex{},
}
}
func (ap *APIPermissions) ReloadIfNeeded() error {
ap.mutex.Lock()
defer ap.mutex.Unlock()
filename, err := authkeyPath()
if err != nil {
return err
}
fileInfo, err := os.Stat(filename)
if err != nil {
return fmt.Errorf("failed to stat file: %v", err)
}
if !fileInfo.ModTime().After(ap.lastModified) {
return nil
}
file, err := os.Open(filename)
if err != nil {
return fmt.Errorf("failed to open file: %v", err)
}
defer file.Close()
ap.lastModified = fileInfo.ModTime()
return ap.parse(file)
}
func (ap *APIPermissions) parse(r io.Reader) error {
ap.permissions = make(map[string]*KeyPermission)
scanner := bufio.NewScanner(r)
var cnt int
for scanner.Scan() {
cnt += 1
line := strings.TrimSpace(scanner.Text())
if line == "" || strings.HasPrefix(line, "#") {
continue
}
line = ws.ReplaceAllString(line, " ")
entry, err := ap.parseLine(line)
if err != nil {
slog.Warn(fmt.Sprintf("authorized_keys line %d: skipping invalid line: %v\n", cnt, err))
continue
}
var pubKeyStr string
if entry.PublicKey == "*" {
pubKeyStr = "*"
} else {
pubKey, err := ap.validateAndDecodeKey(entry)
if err != nil {
slog.Warn(fmt.Sprintf("authorized_keys line %d: invalid key for %s: %v\n", cnt, entry.Name, err))
continue
}
pubKeyStr = pubKey
}
if perm, exists := ap.permissions[pubKeyStr]; exists {
if perm.Name == "default" {
perm.Name = entry.Name
}
if len(perm.Endpoints) == 1 && perm.Endpoints[0] == "*" {
// skip redundant entries
continue
} else if len(entry.Endpoints) == 1 && entry.Endpoints[0] == "*" {
// overwrite redundant entries
perm.Endpoints = entry.Endpoints
} else {
perm.Endpoints = append(perm.Endpoints, entry.Endpoints...)
}
} else {
ap.permissions[pubKeyStr] = &KeyPermission{
Name: entry.Name,
Endpoints: entry.Endpoints,
}
}
}
return scanner.Err()
}
func (ap *APIPermissions) parseLine(line string) (*KeyEntry, error) {
parts := strings.SplitN(line, " ", 4)
if len(parts) < 2 {
return nil, fmt.Errorf("key type and public key not found")
}
kind, b64Key := parts[0], parts[1]
name := "default"
eps := "*"
if len(parts) >= 3 && parts[2] != "" {
if parts[2] != "*" {
name = parts[2]
}
}
if len(parts) == 4 && parts[3] != "" {
eps = parts[3]
}
if kind != "ssh-ed25519" && kind != "*" {
return nil, fmt.Errorf("unsupported key type %s", kind)
}
if kind == "*" && b64Key != "*" {
return nil, fmt.Errorf("unsupported key type")
}
var endpoints []string
if eps == "*" {
endpoints = []string{"*"}
} else {
for _, e := range strings.Split(eps, ",") {
e = strings.TrimSpace(e)
if e == "" {
return nil, fmt.Errorf("empty endpoint in list")
} else if e == "*" {
endpoints = []string{"*"}
break
}
endpoints = append(endpoints, e)
}
}
return &KeyEntry{
PublicKey: b64Key,
Name: name,
Endpoints: endpoints,
}, nil
}
func (ap *APIPermissions) validateAndDecodeKey(entry *KeyEntry) (string, error) {
keyBlob, err := base64.StdEncoding.DecodeString(entry.PublicKey)
if err != nil {
return "", fmt.Errorf("base64 decode: %w", err)
}
pub, err := ssh.ParsePublicKey(keyBlob)
if err != nil {
return "", fmt.Errorf("parse key: %w", err)
}
if pub.Type() != ssh.KeyAlgoED25519 {
return "", fmt.Errorf("key is not Ed25519")
}
return entry.PublicKey, nil
}
func (ap *APIPermissions) Authorize(pubKey ssh.PublicKey, endpoint string) (bool, string, error) {
if err := ap.ReloadIfNeeded(); err != nil {
return false, "unknown", err
}
ap.mutex.RLock()
defer ap.mutex.RUnlock()
if wildcardPerm, exists := ap.permissions["*"]; exists {
if len(wildcardPerm.Endpoints) == 1 && wildcardPerm.Endpoints[0] == "*" {
return true, wildcardPerm.Name, nil
}
for _, allowedEndpoint := range wildcardPerm.Endpoints {
if allowedEndpoint == endpoint {
return true, wildcardPerm.Name, nil
}
}
}
keyString := string(ssh.MarshalAuthorizedKey(pubKey))
parts := strings.SplitN(keyString, " ", 2)
var base64Key string
if len(parts) > 1 {
base64Key = parts[1]
} else {
base64Key = parts[0]
}
base64Key = strings.TrimSpace(base64Key)
perm, exists := ap.permissions[base64Key]
if !exists {
return false, "unknown", nil
}
if len(perm.Endpoints) == 1 && perm.Endpoints[0] == "*" {
return true, perm.Name, nil
}
for _, allowedEndpoint := range perm.Endpoints {
if allowedEndpoint == endpoint {
return true, perm.Name, nil
}
}
return false, "unknown", nil
}

View File

@@ -1,133 +0,0 @@
package auth
import (
"bytes"
"reflect"
"testing"
)
const validB64 = "AAAAC3NzaC1lZDI1NTE5AAAAICy1v/Sn0kGhu1LXzCsnx3wlk5ESdncS66JWo13yeJod"
func TestParse(t *testing.T) {
tests := []struct {
name string
file string
want map[string]*KeyPermission
}{
{
name: "two fields only defaults",
file: "ssh-ed25519 " + validB64 + "\n",
want: map[string]*KeyPermission{
validB64: {
Name: "default",
Endpoints: []string{"*"},
},
},
},
{
name: "extra whitespace collapsed and default endpoints",
file: "ssh-ed25519 " + validB64 + " alice\n",
want: map[string]*KeyPermission{
validB64: {
Name: "alice",
Endpoints: []string{"*"},
},
},
},
{
name: "four fields full",
file: "ssh-ed25519 " + validB64 + " bob /api/foo,/api/bar\n",
want: map[string]*KeyPermission{
validB64: {
Name: "bob",
Endpoints: []string{"/api/foo", "/api/bar"},
},
},
},
{
name: "comment lines ignored and multiple entries",
file: "# header\n\nssh-ed25519 " + validB64 + " user1\nssh-ed25519 " + validB64 + " user2 /api/x\n",
want: map[string]*KeyPermission{
validB64: {
Name: "user1",
Endpoints: []string{"*"},
},
},
},
{
name: "three entries variety",
file: "ssh-ed25519 " + validB64 + "\nssh-ed25519 " + validB64 + " alice /api/a,/api/b\nssh-ed25519 " + validB64 + " bob /api/c\n",
want: map[string]*KeyPermission{
validB64: {
Name: "alice",
Endpoints: []string{"*"},
},
},
},
{
name: "two entries w/ wildcard",
file: "ssh-ed25519 " + validB64 + " alice /api/a\n* * * /api/b\n",
want: map[string]*KeyPermission{
validB64: {
Name: "alice",
Endpoints: []string{"/api/a"},
},
"*": {
Name: "default",
Endpoints: []string{"/api/b"},
},
},
},
{
name: "tags for everyone",
file: "* * * /api/tags",
want: map[string]*KeyPermission{
"*": {
Name: "default",
Endpoints: []string{"/api/tags"},
},
},
},
{
name: "default name",
file: "* * somename",
want: map[string]*KeyPermission{
"*": {
Name: "somename",
Endpoints: []string{"*"},
},
},
},
{
name: "unsupported key type",
file: "ssh-rsa AAAAB3Nza...\n",
want: map[string]*KeyPermission{},
},
{
name: "bad base64",
file: "ssh-ed25519 invalid@@@\n",
want: map[string]*KeyPermission{},
},
{
name: "just an asterix",
file: "*\n",
want: map[string]*KeyPermission{},
},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
perms := NewAPIPermissions()
err := perms.parse(bytes.NewBufferString(tc.file))
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if len(perms.permissions) != len(tc.want) {
t.Fatalf("got %d entries, want %d", len(perms.permissions), len(tc.want))
}
if !reflect.DeepEqual(perms.permissions, tc.want) {
t.Errorf("got %+v, want %+v", perms.permissions, tc.want)
}
})
}
}

View File

@@ -47,6 +47,8 @@ import (
"github.com/ollama/ollama/version"
)
const ConnectInstructions = "To sign in, navigate to:\n %s\n\n"
// ensureThinkingSupport emits a warning if the model does not advertise thinking support
func ensureThinkingSupport(ctx context.Context, client *api.Client, name string) {
if name == "" {
@@ -56,10 +58,8 @@ func ensureThinkingSupport(ctx context.Context, client *api.Client, name string)
if err != nil {
return
}
for _, cap := range resp.Capabilities {
if cap == model.CapabilityThinking {
return
}
if slices.Contains(resp.Capabilities, model.CapabilityThinking) {
return
}
fmt.Fprintf(os.Stderr, "warning: model %q does not support thinking output\n", name)
}
@@ -288,7 +288,17 @@ func loadOrUnloadModel(cmd *cobra.Command, opts *runOptions) error {
Think: opts.Think,
}
return client.Generate(cmd.Context(), req, func(api.GenerateResponse) error { return nil })
return client.Generate(cmd.Context(), req, func(r api.GenerateResponse) error {
if r.RemoteModel != "" && opts.ShowConnect {
p.StopAndClear()
if strings.HasPrefix(r.RemoteHost, "https://ollama.com") {
fmt.Fprintf(os.Stderr, "Connecting to '%s' on 'ollama.com' ⚡\n", r.RemoteModel)
} else {
fmt.Fprintf(os.Stderr, "Connecting to '%s' on '%s'\n", r.RemoteModel, r.RemoteHost)
}
}
return nil
})
}
func StopHandler(cmd *cobra.Command, args []string) error {
@@ -309,9 +319,10 @@ func RunHandler(cmd *cobra.Command, args []string) error {
interactive := true
opts := runOptions{
Model: args[0],
WordWrap: os.Getenv("TERM") == "xterm-256color",
Options: map[string]any{},
Model: args[0],
WordWrap: os.Getenv("TERM") == "xterm-256color",
Options: map[string]any{},
ShowConnect: true,
}
format, err := cmd.Flags().GetString("format")
@@ -322,11 +333,23 @@ func RunHandler(cmd *cobra.Command, args []string) error {
thinkFlag := cmd.Flags().Lookup("think")
if thinkFlag.Changed {
think, err := cmd.Flags().GetBool("think")
thinkStr, err := cmd.Flags().GetString("think")
if err != nil {
return err
}
opts.Think = &think
// Handle different values for --think
switch thinkStr {
case "", "true":
// --think or --think=true
opts.Think = &api.ThinkValue{Value: true}
case "false":
opts.Think = &api.ThinkValue{Value: false}
case "high", "medium", "low":
opts.Think = &api.ThinkValue{Value: thinkStr}
default:
return fmt.Errorf("invalid value for --think: %q (must be true, false, high, medium, or low)", thinkStr)
}
} else {
opts.Think = nil
}
@@ -357,6 +380,7 @@ func RunHandler(cmd *cobra.Command, args []string) error {
}
prompts = append([]string{string(in)}, prompts...)
opts.ShowConnect = false
opts.WordWrap = false
interactive = false
}
@@ -423,6 +447,15 @@ func RunHandler(cmd *cobra.Command, args []string) error {
if interactive {
if err := loadOrUnloadModel(cmd, &opts); err != nil {
var sErr api.AuthorizationError
if errors.As(err, &sErr) && sErr.StatusCode == http.StatusUnauthorized {
fmt.Printf("You need to be signed in to Ollama to run Cloud models.\n\n")
if sErr.SigninURL != "" {
fmt.Printf(ConnectInstructions, sErr.SigninURL)
}
return nil
}
return err
}
@@ -443,6 +476,59 @@ func RunHandler(cmd *cobra.Command, args []string) error {
return generate(cmd, opts)
}
func SigninHandler(cmd *cobra.Command, args []string) error {
client, err := api.ClientFromEnvironment()
if err != nil {
return err
}
user, err := client.Whoami(cmd.Context())
if err != nil {
var aErr api.AuthorizationError
if errors.As(err, &aErr) && aErr.StatusCode == http.StatusUnauthorized {
fmt.Println("You need to be signed in to Ollama to run Cloud models.")
fmt.Println()
if aErr.SigninURL != "" {
fmt.Printf(ConnectInstructions, aErr.SigninURL)
}
return nil
}
return err
}
if user != nil && user.Name != "" {
fmt.Printf("You are already signed in as user '%s'\n", user.Name)
fmt.Println()
return nil
}
return nil
}
func SignoutHandler(cmd *cobra.Command, args []string) error {
client, err := api.ClientFromEnvironment()
if err != nil {
return err
}
err = client.Signout(cmd.Context())
if err != nil {
var aErr api.AuthorizationError
if errors.As(err, &aErr) && aErr.StatusCode == http.StatusUnauthorized {
fmt.Println("You are not signed in to ollama.com")
fmt.Println()
return nil
} else {
return err
}
}
fmt.Println("You have signed out of ollama.com")
fmt.Println()
return nil
}
func PushHandler(cmd *cobra.Command, args []string) error {
client, err := api.ClientFromEnvironment()
if err != nil {
@@ -454,6 +540,25 @@ func PushHandler(cmd *cobra.Command, args []string) error {
return err
}
n := model.ParseName(args[0])
if strings.HasSuffix(n.Host, ".ollama.ai") || strings.HasSuffix(n.Host, ".ollama.com") {
_, err := client.Whoami(cmd.Context())
if err != nil {
var aErr api.AuthorizationError
if errors.As(err, &aErr) && aErr.StatusCode == http.StatusUnauthorized {
fmt.Println("You need to be signed in to push models to ollama.com.")
fmt.Println()
if aErr.SigninURL != "" {
fmt.Printf(ConnectInstructions, aErr.SigninURL)
}
return nil
}
return err
}
}
p := progress.NewProgress(os.Stderr)
defer p.Stop()
@@ -490,12 +595,12 @@ func PushHandler(cmd *cobra.Command, args []string) error {
request := api.PushRequest{Name: args[0], Insecure: insecure}
n := model.ParseName(args[0])
if err := client.Push(cmd.Context(), &request, fn); err != nil {
if spinner != nil {
spinner.Stop()
}
if strings.Contains(err.Error(), "access denied") {
errStr := strings.ToLower(err.Error())
if strings.Contains(errStr, "access denied") || strings.Contains(errStr, "unauthorized") {
return errors.New("you are not authorized to push to this namespace, create the model under a namespace you own")
}
return err
@@ -529,7 +634,14 @@ func ListHandler(cmd *cobra.Command, args []string) error {
for _, m := range models.Models {
if len(args) == 0 || strings.HasPrefix(strings.ToLower(m.Name), strings.ToLower(args[0])) {
data = append(data, []string{m.Name, m.Digest[:12], format.HumanBytes(m.Size), format.HumanTime(m.ModifiedAt, "Never")})
var size string
if m.RemoteModel != "" {
size = "-"
} else {
size = format.HumanBytes(m.Size)
}
data = append(data, []string{m.Name, m.Digest[:12], size, format.HumanTime(m.ModifiedAt, "Never")})
}
}
@@ -614,8 +726,8 @@ func DeleteHandler(cmd *cobra.Command, args []string) error {
KeepAlive: &api.Duration{Duration: 0},
}
if err := loadOrUnloadModel(cmd, opts); err != nil {
if !strings.Contains(err.Error(), "not found") {
return fmt.Errorf("unable to stop existing running model \"%s\": %s", args[0], err)
if !strings.Contains(strings.ToLower(err.Error()), "not found") {
fmt.Fprintf(os.Stderr, "Warning: unable to stop model '%s'\n", args[0])
}
}
@@ -726,12 +838,36 @@ func showInfo(resp *api.ShowResponse, verbose bool, w io.Writer) error {
}
tableRender("Model", func() (rows [][]string) {
if resp.RemoteHost != "" {
rows = append(rows, []string{"", "Remote model", resp.RemoteModel})
rows = append(rows, []string{"", "Remote URL", resp.RemoteHost})
}
if resp.ModelInfo != nil {
arch := resp.ModelInfo["general.architecture"].(string)
rows = append(rows, []string{"", "architecture", arch})
rows = append(rows, []string{"", "parameters", format.HumanNumber(uint64(resp.ModelInfo["general.parameter_count"].(float64)))})
rows = append(rows, []string{"", "context length", strconv.FormatFloat(resp.ModelInfo[fmt.Sprintf("%s.context_length", arch)].(float64), 'f', -1, 64)})
rows = append(rows, []string{"", "embedding length", strconv.FormatFloat(resp.ModelInfo[fmt.Sprintf("%s.embedding_length", arch)].(float64), 'f', -1, 64)})
var paramStr string
if resp.Details.ParameterSize != "" {
paramStr = resp.Details.ParameterSize
} else if v, ok := resp.ModelInfo["general.parameter_count"]; ok {
if f, ok := v.(float64); ok {
paramStr = format.HumanNumber(uint64(f))
}
}
rows = append(rows, []string{"", "parameters", paramStr})
if v, ok := resp.ModelInfo[fmt.Sprintf("%s.context_length", arch)]; ok {
if f, ok := v.(float64); ok {
rows = append(rows, []string{"", "context length", strconv.FormatFloat(f, 'f', -1, 64)})
}
}
if v, ok := resp.ModelInfo[fmt.Sprintf("%s.embedding_length", arch)]; ok {
if f, ok := v.(float64); ok {
rows = append(rows, []string{"", "embedding length", strconv.FormatFloat(f, 'f', -1, 64)})
}
}
} else {
rows = append(rows, []string{"", "architecture", resp.Details.Family})
rows = append(rows, []string{"", "parameters", resp.Details.ParameterSize})
@@ -977,8 +1113,54 @@ type runOptions struct {
Options map[string]any
MultiModal bool
KeepAlive *api.Duration
Think *bool
Think *api.ThinkValue
HideThinking bool
ShowConnect bool
}
func (r runOptions) Copy() runOptions {
var messages []api.Message
if r.Messages != nil {
messages = make([]api.Message, len(r.Messages))
copy(messages, r.Messages)
}
var images []api.ImageData
if r.Images != nil {
images = make([]api.ImageData, len(r.Images))
copy(images, r.Images)
}
var opts map[string]any
if r.Options != nil {
opts = make(map[string]any, len(r.Options))
for k, v := range r.Options {
opts[k] = v
}
}
var think *api.ThinkValue
if r.Think != nil {
cThink := *r.Think
think = &cThink
}
return runOptions{
Model: r.Model,
ParentModel: r.ParentModel,
Prompt: r.Prompt,
Messages: messages,
WordWrap: r.WordWrap,
Format: r.Format,
System: r.System,
Images: images,
Options: opts,
MultiModal: r.MultiModal,
KeepAlive: r.KeepAlive,
Think: think,
HideThinking: r.HideThinking,
ShowConnect: r.ShowConnect,
}
}
type displayResponseState struct {
@@ -1017,10 +1199,11 @@ func displayResponse(content string, wordWrap bool, state *displayResponseState)
}
switch ch {
case ' ':
case ' ', '\t':
state.wordBuffer = ""
case '\n':
case '\n', '\r':
state.lineLength = 0
state.wordBuffer = ""
default:
state.wordBuffer += string(ch)
}
@@ -1078,6 +1261,7 @@ func chat(cmd *cobra.Command, opts runOptions) (*api.Message, error) {
}()
var state *displayResponseState = &displayResponseState{}
var thinkingContent strings.Builder
var latest api.ChatResponse
var fullResponse strings.Builder
var thinkTagOpened bool = false
@@ -1097,14 +1281,21 @@ func chat(cmd *cobra.Command, opts runOptions) (*api.Message, error) {
if !thinkTagOpened {
fmt.Print(thinkingOutputOpeningText(false))
thinkTagOpened = true
thinkTagClosed = false
}
thinkingContent.WriteString(response.Message.Thinking)
displayResponse(response.Message.Thinking, opts.WordWrap, state)
}
content := response.Message.Content
if thinkTagOpened && !thinkTagClosed && content != "" {
if thinkTagOpened && !thinkTagClosed && (content != "" || len(response.Message.ToolCalls) > 0) {
if !strings.HasSuffix(thinkingContent.String(), "\n") {
fmt.Println()
}
fmt.Print(thinkingOutputClosingText(false))
thinkTagOpened = false
thinkTagClosed = true
state = &displayResponseState{}
}
// purposefully not putting thinking blocks in the response, which would
// only be needed if we later added tool calling to the cli (they get
@@ -1112,6 +1303,13 @@ func chat(cmd *cobra.Command, opts runOptions) (*api.Message, error) {
// about to finish some tool calls)
fullResponse.WriteString(content)
if response.Message.ToolCalls != nil {
toolCalls := response.Message.ToolCalls
if len(toolCalls) > 0 {
fmt.Print(renderToolCalls(toolCalls, false))
}
}
displayResponse(content, opts.WordWrap, state)
return nil
@@ -1196,6 +1394,7 @@ func generate(cmd *cobra.Command, opts runOptions) error {
}()
var state *displayResponseState = &displayResponseState{}
var thinkingContent strings.Builder
var thinkTagOpened bool = false
var thinkTagClosed bool = false
@@ -1213,17 +1412,31 @@ func generate(cmd *cobra.Command, opts runOptions) error {
if !thinkTagOpened {
fmt.Print(thinkingOutputOpeningText(plainText))
thinkTagOpened = true
thinkTagClosed = false
}
thinkingContent.WriteString(response.Thinking)
displayResponse(response.Thinking, opts.WordWrap, state)
}
if thinkTagOpened && !thinkTagClosed && content != "" {
if thinkTagOpened && !thinkTagClosed && (content != "" || len(response.ToolCalls) > 0) {
if !strings.HasSuffix(thinkingContent.String(), "\n") {
fmt.Println()
}
fmt.Print(thinkingOutputClosingText(plainText))
thinkTagOpened = false
thinkTagClosed = true
state = &displayResponseState{}
}
displayResponse(content, opts.WordWrap, state)
if response.ToolCalls != nil {
toolCalls := response.ToolCalls
if len(toolCalls) > 0 {
fmt.Print(renderToolCalls(toolCalls, plainText))
}
}
return nil
}
@@ -1463,7 +1676,8 @@ func NewCLI() *cobra.Command {
runCmd.Flags().Bool("insecure", false, "Use an insecure registry")
runCmd.Flags().Bool("nowordwrap", false, "Don't wrap words to the next line automatically")
runCmd.Flags().String("format", "", "Response format (e.g. json)")
runCmd.Flags().Bool("think", false, "Whether to use thinking mode for supported models")
runCmd.Flags().String("think", "", "Enable thinking mode: true/false or high/medium/low for supported models")
runCmd.Flags().Lookup("think").NoOptDefVal = "true"
runCmd.Flags().Bool("hidethinking", false, "Hide thinking output (if provided)")
stopCmd := &cobra.Command{
@@ -1502,6 +1716,22 @@ func NewCLI() *cobra.Command {
pushCmd.Flags().Bool("insecure", false, "Use an insecure registry")
signinCmd := &cobra.Command{
Use: "signin",
Short: "Sign in to ollama.com",
Args: cobra.ExactArgs(0),
PreRunE: checkServerHeartbeat,
RunE: SigninHandler,
}
signoutCmd := &cobra.Command{
Use: "signout",
Short: "Sign out from ollama.com",
Args: cobra.ExactArgs(0),
PreRunE: checkServerHeartbeat,
RunE: SignoutHandler,
}
listCmd := &cobra.Command{
Use: "list",
Aliases: []string{"ls"},
@@ -1568,6 +1798,7 @@ func NewCLI() *cobra.Command {
appendEnvDocs(cmd, []envconfig.EnvVar{
envVars["OLLAMA_DEBUG"],
envVars["OLLAMA_HOST"],
envVars["OLLAMA_CONTEXT_LENGTH"],
envVars["OLLAMA_KEEP_ALIVE"],
envVars["OLLAMA_MAX_LOADED_MODELS"],
envVars["OLLAMA_MAX_QUEUE"],
@@ -1595,6 +1826,8 @@ func NewCLI() *cobra.Command {
stopCmd,
pullCmd,
pushCmd,
signinCmd,
signoutCmd,
listCmd,
psCmd,
copyCmd,
@@ -1613,7 +1846,7 @@ func NewCLI() *cobra.Command {
// to false).
//
// If capabilities are not provided, we fetch them from the server.
func inferThinkingOption(caps *[]model.Capability, runOpts *runOptions, explicitlySetByUser bool) (*bool, error) {
func inferThinkingOption(caps *[]model.Capability, runOpts *runOptions, explicitlySetByUser bool) (*api.ThinkValue, error) {
if explicitlySetByUser {
return runOpts.Think, nil
}
@@ -1640,9 +1873,34 @@ func inferThinkingOption(caps *[]model.Capability, runOpts *runOptions, explicit
}
if thinkingSupported {
thinking := true
return &thinking, nil
return &api.ThinkValue{Value: true}, nil
}
return nil, nil
}
func renderToolCalls(toolCalls []api.ToolCall, plainText bool) string {
out := ""
formatExplanation := ""
formatValues := ""
if !plainText {
formatExplanation = readline.ColorGrey + readline.ColorBold
formatValues = readline.ColorDefault
out += formatExplanation
}
for i, toolCall := range toolCalls {
argsAsJSON, err := json.Marshal(toolCall.Function.Arguments)
if err != nil {
return ""
}
if i > 0 {
out += "\n"
}
// all tool calls are unexpected since we don't currently support registering any in the CLI
out += fmt.Sprintf(" Model called a non-existent function '%s()' with arguments: %s", formatValues+toolCall.Function.Name+formatExplanation, formatValues+string(argsAsJSON)+formatExplanation)
}
if !plainText {
out += readline.ColorDefault
}
return out
}

View File

@@ -3,10 +3,12 @@ package cmd
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httptest"
"os"
"reflect"
"strings"
"testing"
"time"
@@ -304,6 +306,8 @@ func TestDeleteHandler(t *testing.T) {
w.WriteHeader(http.StatusOK)
} else {
w.WriteHeader(http.StatusNotFound)
errPayload := `{"error":"model '%s' not found"}`
w.Write([]byte(fmt.Sprintf(errPayload, req.Name)))
}
return
}
@@ -346,7 +350,7 @@ func TestDeleteHandler(t *testing.T) {
}
err := DeleteHandler(cmd, []string{"test-model-not-found"})
if err == nil || !strings.Contains(err.Error(), "unable to stop existing running model \"test-model-not-found\"") {
if err == nil || !strings.Contains(err.Error(), "model 'test-model-not-found' not found") {
t.Fatalf("DeleteHandler failed: expected error about stopping non-existent model, got %v", err)
}
}
@@ -488,9 +492,35 @@ func TestPushHandler(t *testing.T) {
w.(http.Flusher).Flush()
}
},
"/api/me": func(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
t.Errorf("expected POST request, got %s", r.Method)
}
},
},
expectedOutput: "\nYou can find your model at:\n\n\thttps://ollama.com/test-model\n",
},
{
name: "not signed in push",
modelName: "notsignedin-model",
serverResponse: map[string]func(w http.ResponseWriter, r *http.Request){
"/api/me": func(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
t.Errorf("expected POST request, got %s", r.Method)
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusUnauthorized)
err := json.NewEncoder(w).Encode(map[string]string{
"error": "unauthorized",
"signin_url": "https://somethingsomething",
})
if err != nil {
t.Fatal(err)
}
},
},
expectedOutput: "You need to be signed in to push",
},
{
name: "unauthorized push",
modelName: "unauthorized-model",
@@ -499,12 +529,17 @@ func TestPushHandler(t *testing.T) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusUnauthorized)
err := json.NewEncoder(w).Encode(map[string]string{
"error": "access denied",
"error": "403: {\"errors\":[{\"code\":\"ACCESS DENIED\", \"message\":\"access denied\"}]}",
})
if err != nil {
t.Fatal(err)
}
},
"/api/me": func(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
t.Errorf("expected POST request, got %s", r.Method)
}
},
},
expectedError: "you are not authorized to push to this namespace, create the model under a namespace you own",
},
@@ -522,6 +557,10 @@ func TestPushHandler(t *testing.T) {
defer mockServer.Close()
t.Setenv("OLLAMA_HOST", mockServer.URL)
tmpDir := t.TempDir()
t.Setenv("HOME", tmpDir)
t.Setenv("USERPROFILE", tmpDir)
initializeKeypair()
cmd := &cobra.Command{}
cmd.Flags().Bool("insecure", false, "")
@@ -557,7 +596,7 @@ func TestPushHandler(t *testing.T) {
t.Errorf("expected no error, got %v", err)
}
if tt.expectedOutput != "" {
if got := string(stdout); got != tt.expectedOutput {
if got := string(stdout); !strings.Contains(got, tt.expectedOutput) {
t.Errorf("expected output %q, got %q", tt.expectedOutput, got)
}
}
@@ -915,3 +954,286 @@ func TestNewCreateRequest(t *testing.T) {
})
}
}
func TestRunOptions_Copy(t *testing.T) {
// Setup test data
originalKeepAlive := &api.Duration{Duration: 5 * time.Minute}
originalThink := &api.ThinkValue{Value: "test reasoning"}
original := runOptions{
Model: "test-model",
ParentModel: "parent-model",
Prompt: "test prompt",
Messages: []api.Message{
{Role: "user", Content: "hello"},
{Role: "assistant", Content: "hi there"},
},
WordWrap: true,
Format: "json",
System: "system prompt",
Images: []api.ImageData{
[]byte("image1"),
[]byte("image2"),
},
Options: map[string]any{
"temperature": 0.7,
"max_tokens": 1000,
"top_p": 0.9,
},
MultiModal: true,
KeepAlive: originalKeepAlive,
Think: originalThink,
HideThinking: false,
ShowConnect: true,
}
// Test the copy
copied := original.Copy()
// Test 1: Verify the copy is not the same instance
if &copied == &original {
t.Error("Copy should return a different instance")
}
// Test 2: Verify all fields are copied correctly
tests := []struct {
name string
got interface{}
want interface{}
}{
{"Model", copied.Model, original.Model},
{"ParentModel", copied.ParentModel, original.ParentModel},
{"Prompt", copied.Prompt, original.Prompt},
{"WordWrap", copied.WordWrap, original.WordWrap},
{"Format", copied.Format, original.Format},
{"System", copied.System, original.System},
{"MultiModal", copied.MultiModal, original.MultiModal},
{"HideThinking", copied.HideThinking, original.HideThinking},
{"ShowConnect", copied.ShowConnect, original.ShowConnect},
}
for _, tt := range tests {
if !reflect.DeepEqual(tt.got, tt.want) {
t.Errorf("%s mismatch: got %v, want %v", tt.name, tt.got, tt.want)
}
}
// Test 3: Verify Messages slice is deeply copied
if len(copied.Messages) != len(original.Messages) {
t.Errorf("Messages length mismatch: got %d, want %d", len(copied.Messages), len(original.Messages))
}
if len(copied.Messages) > 0 && &copied.Messages[0] == &original.Messages[0] {
t.Error("Messages should be different instances")
}
// Modify original to verify independence
if len(original.Messages) > 0 {
originalContent := original.Messages[0].Content
original.Messages[0].Content = "modified"
if len(copied.Messages) > 0 && copied.Messages[0].Content == "modified" {
t.Error("Messages should be independent after copy")
}
// Restore for other tests
original.Messages[0].Content = originalContent
}
// Test 4: Verify Images slice is deeply copied
if len(copied.Images) != len(original.Images) {
t.Errorf("Images length mismatch: got %d, want %d", len(copied.Images), len(original.Images))
}
if len(copied.Images) > 0 && &copied.Images[0] == &original.Images[0] {
t.Error("Images should be different instances")
}
// Modify original to verify independence
if len(original.Images) > 0 {
originalImage := original.Images[0]
original.Images[0] = []byte("modified")
if len(copied.Images) > 0 && string(copied.Images[0]) == "modified" {
t.Error("Images should be independent after copy")
}
// Restore for other tests
original.Images[0] = originalImage
}
// Test 5: Verify Options map is deeply copied
if len(copied.Options) != len(original.Options) {
t.Errorf("Options length mismatch: got %d, want %d", len(copied.Options), len(original.Options))
}
if len(copied.Options) > 0 && &copied.Options == &original.Options {
t.Error("Options map should be different instances")
}
// Modify original to verify independence
if len(original.Options) > 0 {
originalTemp := original.Options["temperature"]
original.Options["temperature"] = 0.9
if copied.Options["temperature"] == 0.9 {
t.Error("Options should be independent after copy")
}
// Restore for other tests
original.Options["temperature"] = originalTemp
}
// Test 6: Verify KeepAlive pointer is copied (shallow copy)
if copied.KeepAlive != original.KeepAlive {
t.Error("KeepAlive pointer should be the same (shallow copy)")
}
// Test 7: Verify Think pointer creates a new instance
if original.Think != nil && copied.Think == original.Think {
t.Error("Think should be a different instance")
}
if original.Think != nil && copied.Think != nil {
if !reflect.DeepEqual(copied.Think.Value, original.Think.Value) {
t.Errorf("Think.Value mismatch: got %v, want %v", copied.Think.Value, original.Think.Value)
}
}
// Test 8: Test with zero values
zeroOriginal := runOptions{}
zeroCopy := zeroOriginal.Copy()
if !reflect.DeepEqual(zeroCopy, zeroOriginal) {
fmt.Printf("orig: %#v\ncopy: %#v\n", zeroOriginal, zeroCopy)
t.Error("Copy of zero value should equal original zero value")
}
}
func TestRunOptions_Copy_EmptySlicesAndMaps(t *testing.T) {
// Test with empty slices and maps
original := runOptions{
Messages: []api.Message{},
Images: []api.ImageData{},
Options: map[string]any{},
}
copied := original.Copy()
if copied.Messages == nil {
t.Error("Empty Messages slice should remain empty, not nil")
}
if copied.Images == nil {
t.Error("Empty Images slice should remain empty, not nil")
}
if copied.Options == nil {
t.Error("Empty Options map should remain empty, not nil")
}
if len(copied.Messages) != 0 {
t.Error("Empty Messages slice should remain empty")
}
if len(copied.Images) != 0 {
t.Error("Empty Images slice should remain empty")
}
if len(copied.Options) != 0 {
t.Error("Empty Options map should remain empty")
}
}
func TestRunOptions_Copy_NilPointers(t *testing.T) {
// Test with nil pointers
original := runOptions{
KeepAlive: nil,
Think: nil,
}
copied := original.Copy()
if copied.KeepAlive != nil {
t.Error("Nil KeepAlive should remain nil")
}
if copied.Think != nil {
t.Error("Nil Think should remain nil")
}
}
func TestRunOptions_Copy_ThinkValueVariants(t *testing.T) {
tests := []struct {
name string
think *api.ThinkValue
}{
{"nil Think", nil},
{"bool true", &api.ThinkValue{Value: true}},
{"bool false", &api.ThinkValue{Value: false}},
{"string value", &api.ThinkValue{Value: "reasoning text"}},
{"int value", &api.ThinkValue{Value: 42}},
{"nil value", &api.ThinkValue{Value: nil}},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
original := runOptions{Think: tt.think}
copied := original.Copy()
if tt.think == nil {
if copied.Think != nil {
t.Error("Nil Think should remain nil")
}
return
}
if copied.Think == nil {
t.Error("Non-nil Think should not become nil")
return
}
if copied.Think == original.Think {
t.Error("Think should be a different instance")
}
if !reflect.DeepEqual(copied.Think.Value, original.Think.Value) {
t.Errorf("Think.Value mismatch: got %v, want %v", copied.Think.Value, original.Think.Value)
}
})
}
}
func TestRunOptions_Copy_Independence(t *testing.T) {
// Test that modifications to original don't affect copy
originalThink := &api.ThinkValue{Value: "original"}
original := runOptions{
Model: "original-model",
Messages: []api.Message{{Role: "user", Content: "original"}},
Options: map[string]any{"key": "value"},
Think: originalThink,
}
copied := original.Copy()
// Modify original
original.Model = "modified-model"
if len(original.Messages) > 0 {
original.Messages[0].Content = "modified"
}
original.Options["key"] = "modified"
if original.Think != nil {
original.Think.Value = "modified"
}
// Verify copy is unchanged
if copied.Model == "modified-model" {
t.Error("Copy Model should not be affected by original modification")
}
if len(copied.Messages) > 0 && copied.Messages[0].Content == "modified" {
t.Error("Copy Messages should not be affected by original modification")
}
if copied.Options["key"] == "modified" {
t.Error("Copy Options should not be affected by original modification")
}
if copied.Think != nil && copied.Think.Value == "modified" {
t.Error("Copy Think should not be affected by original modification")
}
}

View File

@@ -195,16 +195,24 @@ func generateInteractive(cmd *cobra.Command, opts runOptions) error {
fmt.Println("Usage:\n /load <modelname>")
continue
}
origOpts := opts.Copy()
opts.Model = args[1]
opts.Messages = []api.Message{}
fmt.Printf("Loading model '%s'\n", opts.Model)
opts.Think, err = inferThinkingOption(nil, &opts, thinkExplicitlySet)
if err != nil {
if strings.Contains(err.Error(), "not found") {
fmt.Printf("Couldn't find model '%s'\n", opts.Model)
opts = origOpts.Copy()
continue
}
return err
}
if err := loadOrUnloadModel(cmd, &opts); err != nil {
if strings.Contains(err.Error(), "not found") {
fmt.Printf("error: %v\n", err)
fmt.Printf("Couldn't find model '%s'\n", opts.Model)
opts = origOpts.Copy()
continue
}
if strings.Contains(err.Error(), "does not support thinking") {
@@ -272,16 +280,29 @@ func generateInteractive(cmd *cobra.Command, opts runOptions) error {
}
fmt.Println("Set 'quiet' mode.")
case "think":
think := true
opts.Think = &think
thinkValue := api.ThinkValue{Value: true}
var maybeLevel string
if len(args) > 2 {
maybeLevel = args[2]
}
if maybeLevel != "" {
// TODO(drifkin): validate the level, could be model dependent
// though... It will also be validated on the server once a call is
// made.
thinkValue.Value = maybeLevel
}
opts.Think = &thinkValue
thinkExplicitlySet = true
if client, err := api.ClientFromEnvironment(); err == nil {
ensureThinkingSupport(cmd.Context(), client, opts.Model)
}
fmt.Println("Set 'think' mode.")
if maybeLevel != "" {
fmt.Printf("Set 'think' mode to '%s'.\n", maybeLevel)
} else {
fmt.Println("Set 'think' mode.")
}
case "nothink":
think := false
opts.Think = &think
opts.Think = &api.ThinkValue{Value: false}
thinkExplicitlySet = true
if client, err := api.ClientFromEnvironment(); err == nil {
ensureThinkingSupport(cmd.Context(), client, opts.Model)
@@ -478,7 +499,8 @@ func generateInteractive(cmd *cobra.Command, opts runOptions) error {
assistant, err := chat(cmd, opts)
if err != nil {
if strings.Contains(err.Error(), "does not support thinking") {
if strings.Contains(err.Error(), "does not support thinking") ||
strings.Contains(err.Error(), "invalid think value") {
fmt.Printf("error: %v\n", err)
sb.Reset()
continue

View File

@@ -202,6 +202,8 @@ func ConvertModel(fsys fs.FS, f *os.File) error {
conv = &bertModel{}
case "CohereForCausalLM":
conv = &commandrModel{}
case "GptOssForCausalLM":
conv = &gptossModel{}
default:
return fmt.Errorf("unsupported architecture %q", p.Architectures[0])
}

View File

@@ -28,6 +28,7 @@ type bertModel struct {
LayerNormEPS float32 `json:"layer_norm_eps"`
LayerNormEpsilon float32 `json:"layer_norm_epsilon"`
NormEpsilon float32 `json:"norm_epsilon"`
normalizeEmbeddings bool
PoolingType uint32
}
@@ -54,9 +55,11 @@ func (p *bertModel) parseMore(fsys fs.FS) error {
var pooling string
for _, m := range modules {
if m.Type == "sentence_transformers.models.Pooling" {
switch m.Type {
case "sentence_transformers.models.Pooling":
pooling = m.Path
break
case "sentence_transformers.models.Normalize":
p.normalizeEmbeddings = true
}
}
@@ -90,6 +93,7 @@ func (p *bertModel) KV(t *Tokenizer) ggml.KV {
kv["general.architecture"] = "bert"
kv["bert.attention.causal"] = false
kv["bert.pooling_type"] = p.PoolingType
kv["bert.normalize_embeddings"] = p.normalizeEmbeddings
kv["bert.block_count"] = cmp.Or(p.NLayers, p.NumHiddenLayers, p.NLayer)

266
convert/convert_gptoss.go Normal file
View File

@@ -0,0 +1,266 @@
package convert
import (
"bytes"
"cmp"
"encoding/binary"
"io"
"slices"
"strings"
"github.com/ollama/ollama/fs/ggml"
"github.com/pdevine/tensor"
"github.com/pdevine/tensor/native"
)
type gptossModel struct {
ModelParameters
HiddenLayers uint32 `json:"num_hidden_layers"`
MaxPositionEmbeddings uint32 `json:"max_position_embeddings"`
HiddenSize uint32 `json:"hidden_size"`
IntermediateSize uint32 `json:"intermediate_size"`
AttentionHeads uint32 `json:"num_attention_heads"`
KeyValueHeads uint32 `json:"num_key_value_heads"`
HeadDim uint32 `json:"head_dim"`
Experts uint32 `json:"num_experts"`
LocalExperts uint32 `json:"num_local_experts"`
ExpertsPerToken uint32 `json:"experts_per_token"`
RMSNormEpsilon float32 `json:"rms_norm_eps"`
InitialContextLength uint32 `json:"initial_context_length"`
RopeTheta float32 `json:"rope_theta"`
RopeScalingFactor float32 `json:"rope_scaling_factor"`
RopeScaling struct {
Factor float32 `json:"factor"`
} `json:"rope_scaling"`
SlidingWindow uint32 `json:"sliding_window"`
}
var _ ModelConverter = (*gptossModel)(nil)
func (m *gptossModel) KV(t *Tokenizer) ggml.KV {
kv := m.ModelParameters.KV(t)
kv["general.architecture"] = "gptoss"
kv["general.file_type"] = uint32(4)
kv["gptoss.context_length"] = cmp.Or(m.MaxPositionEmbeddings, uint32(m.RopeScalingFactor*float32(m.InitialContextLength)))
kv["gptoss.block_count"] = m.HiddenLayers
kv["gptoss.embedding_length"] = m.HiddenSize
kv["gptoss.feed_forward_length"] = m.IntermediateSize
kv["gptoss.expert_count"] = cmp.Or(m.Experts, m.LocalExperts)
kv["gptoss.expert_used_count"] = m.ExpertsPerToken
kv["gptoss.attention.head_count"] = m.AttentionHeads
kv["gptoss.attention.head_count_kv"] = m.KeyValueHeads
kv["gptoss.attention.key_length"] = m.HeadDim
kv["gptoss.attention.value_length"] = m.HeadDim
kv["gptoss.attention.layer_norm_rms_epsilon"] = cmp.Or(m.RMSNormEpsilon, 1e-5)
kv["gptoss.attention.sliding_window"] = m.SlidingWindow
kv["gptoss.rope.freq_base"] = m.RopeTheta
kv["gptoss.rope.scaling.factor"] = cmp.Or(m.RopeScalingFactor, m.RopeScaling.Factor)
kv["gptoss.rope.scaling.original_context_length"] = m.InitialContextLength
kv["tokenizer.ggml.bos_token_id"] = uint32(199998) // <|startoftext|>
kv["tokenizer.ggml.add_bos_token"] = false
kv["tokenizer.ggml.eos_token_id"] = uint32(199999) // <|endoftext|>
kv["tokenizer.ggml.eos_token_ids"] = []int32{
199999, /* <|endoftext|> */
200002, /* <|return|> */
200012, /* <|call|> */
}
kv["tokenizer.ggml.add_eos_token"] = false
return kv
}
func (m *gptossModel) Tensors(ts []Tensor) []*ggml.Tensor {
var out []*ggml.Tensor
mxfp4s := make(map[string]*mxfp4)
for _, t := range ts {
if strings.HasSuffix(t.Name(), ".blocks") || strings.HasSuffix(t.Name(), ".scales") {
dot := strings.LastIndex(t.Name(), ".")
name, suffix := t.Name()[:dot], t.Name()[dot+1:]
if _, ok := mxfp4s[name]; !ok {
mxfp4s[name] = &mxfp4{}
}
switch suffix {
case "blocks":
mxfp4s[name].blocks = t
case "scales":
mxfp4s[name].scales = t
}
} else if strings.HasSuffix(t.Name(), "gate_up_exps.bias") {
// gate_up_exps is interleaved, need to split into gate_exps and up_exps
// e.g. gate_exps, up_exps = gate_up_exps[:, 0::2, ...], gate_up_exps[:, 1::2, ...]
out = append(out, slices.Collect(splitDim(t, 1,
split{
Replacer: strings.NewReplacer("gate_up_exps", "gate_exps"),
slices: []tensor.Slice{nil, tensor.S(0, int(t.Shape()[1]), 2)},
},
split{
Replacer: strings.NewReplacer("gate_up_exps", "up_exps"),
slices: []tensor.Slice{nil, tensor.S(1, int(t.Shape()[1]), 2)},
},
))...)
} else {
out = append(out, &ggml.Tensor{
Name: t.Name(),
Kind: t.Kind(),
Shape: t.Shape(),
WriterTo: t,
})
}
}
for name, mxfp4 := range mxfp4s {
dims := mxfp4.blocks.Shape()
if strings.Contains(name, "ffn_down_exps") {
out = append(out, &ggml.Tensor{
Name: name + ".weight",
Kind: uint32(ggml.TensorTypeMXFP4),
Shape: []uint64{dims[0], dims[1], dims[2] * dims[3] * 2},
WriterTo: mxfp4,
})
} else if strings.Contains(name, "ffn_gate_up_exps") {
// gate_up_exps is interleaved, need to split into gate_exps and up_exps
// e.g. gate_exps, up_exps = gate_up_exps[:, 0::2, ...], gate_up_exps[:, 1::2, ...]
out = append(out, &ggml.Tensor{
Name: strings.Replace(name, "gate_up", "gate", 1) + ".weight",
Kind: uint32(ggml.TensorTypeMXFP4),
Shape: []uint64{dims[0], dims[1] / 2, dims[2] * dims[3] * 2},
WriterTo: mxfp4.slice(1, 0, int(dims[1]), 2),
}, &ggml.Tensor{
Name: strings.Replace(name, "gate_up", "up", 1) + ".weight",
Kind: uint32(ggml.TensorTypeMXFP4),
Shape: []uint64{dims[0], dims[1] / 2, dims[2] * dims[3] * 2},
WriterTo: mxfp4.slice(1, 1, int(dims[1]), 2),
})
}
}
return out
}
func (m *gptossModel) Replacements() []string {
var replacements []string
if m.MaxPositionEmbeddings > 0 {
// hf flavored model
replacements = []string{
"lm_head", "output",
"model.embed_tokens", "token_embd",
"model.layers", "blk",
"input_layernorm", "attn_norm",
"self_attn.q_proj", "attn_q",
"self_attn.k_proj", "attn_k",
"self_attn.v_proj", "attn_v",
"self_attn.o_proj", "attn_out",
"self_attn.sinks", "attn_sinks",
"post_attention_layernorm", "ffn_norm",
"mlp.router", "ffn_gate_inp",
"mlp.experts.gate_up_proj_", "ffn_gate_up_exps.",
"mlp.experts.down_proj_", "ffn_down_exps.",
"model.norm", "output_norm",
}
} else {
replacements = []string{
// noop replacements so other replacements will not be applied
".blocks", ".blocks",
".scales", ".scales",
// real replacements
"block", "blk",
"attn.norm", "attn_norm",
"attn.qkv", "attn_qkv",
"attn.sinks", "attn_sinks",
"attn.out", "attn_out",
"mlp.norm", "ffn_norm",
"mlp.gate", "ffn_gate_inp",
"mlp.mlp1_", "ffn_gate_up_exps.",
"mlp.mlp2_", "ffn_down_exps.",
"embedding", "token_embd",
"norm", "output_norm",
"unembedding", "output",
"scale", "weight",
}
}
return replacements
}
type mxfp4 struct {
slices []tensor.Slice
blocks, scales Tensor
}
func (m *mxfp4) slice(dim, start, end, step int) *mxfp4 {
slice := slices.Repeat([]tensor.Slice{nil}, len(m.blocks.Shape()))
slice[dim] = tensor.S(start, end, step)
return &mxfp4{
slices: slice,
blocks: m.blocks,
scales: m.scales,
}
}
func (m *mxfp4) WriteTo(w io.Writer) (int64, error) {
var b bytes.Buffer
if _, err := m.blocks.WriteTo(&b); err != nil {
return 0, err
}
blocksDims := make([]int, len(m.blocks.Shape()))
for i, d := range m.blocks.Shape() {
blocksDims[i] = int(d)
}
bts := b.Bytes()
var tmp [16]byte
for i := 0; i < b.Len(); i += 16 {
for j := range 8 {
// transform a1b2c3 ... x7y8z9 -> 71xa82yb93zc
a, b := bts[i+j], bts[i+j+8]
tmp[2*j+0] = (a & 0x0F) | (b << 4)
tmp[2*j+1] = (a >> 4) | (b & 0xF0)
}
copy(bts[i:i+16], tmp[:])
}
var blocks tensor.Tensor = tensor.New(tensor.WithShape(blocksDims...), tensor.WithBacking(bts))
var s bytes.Buffer
if _, err := m.scales.WriteTo(&s); err != nil {
return 0, err
}
scalesDims := slices.Repeat([]int{1}, len(m.blocks.Shape()))
for i, d := range m.scales.Shape() {
scalesDims[i] = int(d)
}
var scales tensor.Tensor = tensor.New(tensor.WithShape(scalesDims...), tensor.WithBacking(s.Bytes()))
out, err := tensor.Concat(3, scales, blocks)
if err != nil {
return 0, err
}
if len(m.slices) > 0 {
out, err = out.Slice(m.slices...)
if err != nil {
return 0, err
}
}
out = tensor.Materialize(out)
if err := out.Reshape(out.Shape().TotalSize()); err != nil {
return 0, err
}
u8s, err := native.VectorU8(out.(*tensor.Dense))
if err != nil {
return 0, err
}
if err := binary.Write(w, binary.LittleEndian, u8s); err != nil {
return 0, err
}
return int64(len(u8s)), nil
}

View File

@@ -18,6 +18,7 @@ import (
"strings"
"testing"
"github.com/google/go-cmp/cmp"
"github.com/ollama/ollama/fs/ggml"
)
@@ -339,13 +340,8 @@ func TestConvertAdapter(t *testing.T) {
}
actual := generateResultsJSON(t, r, m.KV(), m.Tensors())
for _, k := range slices.Sorted(maps.Keys(c.Expected)) {
if v, ok := actual[k]; !ok {
t.Errorf("missing %s", k)
} else if v != c.Expected[k] {
t.Errorf("unexpected %s: want %s, got %s", k, c.Expected[k], v)
}
if diff := cmp.Diff(c.Expected, actual); diff != "" {
t.Errorf("mismatch (-want +got):\n%s", diff)
}
})
}

View File

@@ -31,28 +31,31 @@ func (t tensorBase) Shape() []uint64 {
}
const (
tensorKindF32 uint32 = iota
tensorKindF16
tensorKindFP32 uint32 = iota
tensorKindFP16
tensorKindBF16 = 30
tensorKindMXFP4 = 39
)
func (t tensorBase) Kind() uint32 {
if strings.HasSuffix(t.name, ".ffn_gate_inp.weight") ||
strings.HasSuffix(t.name, ".bias") ||
t.name == "token_types.weight" ||
t.name == "v.positional_embedding_vlm" ||
t.name == "v.tile_position_embd.weight" ||
t.name == "v.pre_tile_position_embd.weight" ||
t.name == "v.post_tile_position_embd.weight" {
// these tensors are always F32
return 0
return tensorKindFP32
}
switch len(t.shape) {
case 0:
panic("invalid tensor shape")
case 1:
return tensorKindF32
return tensorKindFP32
default:
return tensorKindF16
return tensorKindFP16
}
}

View File

@@ -1,6 +1,7 @@
package convert
import (
"bufio"
"bytes"
"encoding/binary"
"encoding/json"
@@ -93,6 +94,15 @@ type safetensor struct {
*tensorBase
}
func (st safetensor) Kind() uint32 {
kind := st.tensorBase.Kind()
if !strings.HasPrefix(st.name, "v.") && st.dtype == "BF16" && kind != tensorKindFP32 {
kind = tensorKindBF16
}
return kind
}
func (st safetensor) Clone() Tensor {
return &safetensor{
fs: st.fs,
@@ -115,26 +125,41 @@ func (st safetensor) WriteTo(w io.Writer) (int64, error) {
}
defer f.Close()
if seeker, ok := f.(io.Seeker); ok {
if _, err := seeker.Seek(st.offset, io.SeekStart); err != nil {
return 0, err
}
} else {
if _, err := io.CopyN(io.Discard, f, st.offset); err != nil {
return 0, err
r, err := func() (io.Reader, error) {
if readerAt, ok := f.(io.ReaderAt); ok {
return io.NewSectionReader(readerAt, st.offset, st.size), nil
} else if seeker, ok := f.(io.Seeker); ok {
_, err := seeker.Seek(st.offset, io.SeekStart)
return f, err
} else {
_, err := io.CopyN(io.Discard, f, st.offset)
return f, err
}
}()
if err != nil {
return 0, err
}
br := bufio.NewReaderSize(r, min(32<<10, int(st.size)))
// special case when input and output are same type and the
// tensor doesn't need repacking
if (st.repacker == nil) &&
((st.dtype == "F32" && st.Kind() == tensorKindFP32) ||
(st.dtype == "F16" && st.Kind() == tensorKindFP16) ||
(st.dtype == "U8")) {
return io.CopyN(w, br, st.size)
}
var f32s []float32
switch st.dtype {
case "F32":
f32s = make([]float32, st.size/4)
if err = binary.Read(f, binary.LittleEndian, f32s); err != nil {
if err = binary.Read(br, binary.LittleEndian, f32s); err != nil {
return 0, err
}
case "F16":
u16s := make([]uint16, st.size/2)
if err = binary.Read(f, binary.LittleEndian, u16s); err != nil {
if err = binary.Read(br, binary.LittleEndian, u16s); err != nil {
return 0, err
}
@@ -145,7 +170,7 @@ func (st safetensor) WriteTo(w io.Writer) (int64, error) {
case "BF16":
u8s := make([]uint8, st.size)
if err = binary.Read(f, binary.LittleEndian, u8s); err != nil {
if err = binary.Read(br, binary.LittleEndian, u8s); err != nil {
return 0, err
}
@@ -162,15 +187,18 @@ func (st safetensor) WriteTo(w io.Writer) (int64, error) {
}
switch st.Kind() {
case tensorKindF32:
return 0, binary.Write(w, binary.LittleEndian, f32s)
case tensorKindF16:
case tensorKindFP32:
return int64(len(f32s) * 4), binary.Write(w, binary.LittleEndian, f32s)
case tensorKindFP16:
f16s := make([]uint16, len(f32s))
for i := range f32s {
f16s[i] = float16.Fromfloat32(f32s[i]).Bits()
}
return 0, binary.Write(w, binary.LittleEndian, f16s)
return int64(len(f16s) * 2), binary.Write(w, binary.LittleEndian, f16s)
case tensorKindBF16:
u8s := bfloat16.EncodeFloat32(f32s)
return int64(len(u8s)), binary.Write(w, binary.LittleEndian, u8s)
default:
return 0, fmt.Errorf("unknown storage type: %d", st.Kind())
}

294
convert/reader_test.go Normal file
View File

@@ -0,0 +1,294 @@
package convert
import (
"bytes"
"encoding/binary"
"os"
"path/filepath"
"testing"
"github.com/d4l3k/go-bfloat16"
"github.com/google/go-cmp/cmp"
"github.com/x448/float16"
)
func TestSafetensors(t *testing.T) {
t.Parallel()
root, err := os.OpenRoot(t.TempDir())
if err != nil {
t.Fatal(err)
}
defer root.Close()
cases := []struct {
name,
dtype string
offset,
size int64
shape []uint64
setup func(*testing.T, *os.File)
want []byte
}{
{
name: "fp32-fp32",
dtype: "F32",
size: 32 * 4, // 32 floats, each 4 bytes
shape: []uint64{32},
setup: func(t *testing.T, f *os.File) {
f32s := make([]float32, 32)
for i := range f32s {
f32s[i] = float32(i)
}
if err := binary.Write(f, binary.LittleEndian, f32s); err != nil {
t.Fatal(err)
}
},
want: []byte{
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x3f, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x40, 0x40,
0x00, 0x00, 0x80, 0x40, 0x00, 0x00, 0xa0, 0x40, 0x00, 0x00, 0xc0, 0x40, 0x00, 0x00, 0xe0, 0x40,
0x00, 0x00, 0x00, 0x41, 0x00, 0x00, 0x10, 0x41, 0x00, 0x00, 0x20, 0x41, 0x00, 0x00, 0x30, 0x41,
0x00, 0x00, 0x40, 0x41, 0x00, 0x00, 0x50, 0x41, 0x00, 0x00, 0x60, 0x41, 0x00, 0x00, 0x70, 0x41,
0x00, 0x00, 0x80, 0x41, 0x00, 0x00, 0x88, 0x41, 0x00, 0x00, 0x90, 0x41, 0x00, 0x00, 0x98, 0x41,
0x00, 0x00, 0xa0, 0x41, 0x00, 0x00, 0xa8, 0x41, 0x00, 0x00, 0xb0, 0x41, 0x00, 0x00, 0xb8, 0x41,
0x00, 0x00, 0xc0, 0x41, 0x00, 0x00, 0xc8, 0x41, 0x00, 0x00, 0xd0, 0x41, 0x00, 0x00, 0xd8, 0x41,
0x00, 0x00, 0xe0, 0x41, 0x00, 0x00, 0xe8, 0x41, 0x00, 0x00, 0xf0, 0x41, 0x00, 0x00, 0xf8, 0x41,
},
},
{
name: "fp32-fp16",
dtype: "F32",
size: 32 * 4, // 32 floats, each 4 bytes
shape: []uint64{16, 2},
setup: func(t *testing.T, f *os.File) {
f32s := make([]float32, 32)
for i := range f32s {
f32s[i] = float32(i)
}
if err := binary.Write(f, binary.LittleEndian, f32s); err != nil {
t.Fatal(err)
}
},
want: []byte{
0x00, 0x00, 0x00, 0x3c, 0x00, 0x40, 0x00, 0x42, 0x00, 0x44, 0x00, 0x45, 0x00, 0x46, 0x00, 0x47,
0x00, 0x48, 0x80, 0x48, 0x00, 0x49, 0x80, 0x49, 0x00, 0x4a, 0x80, 0x4a, 0x00, 0x4b, 0x80, 0x4b,
0x00, 0x4c, 0x40, 0x4c, 0x80, 0x4c, 0xc0, 0x4c, 0x00, 0x4d, 0x40, 0x4d, 0x80, 0x4d, 0xc0, 0x4d,
0x00, 0x4e, 0x40, 0x4e, 0x80, 0x4e, 0xc0, 0x4e, 0x00, 0x4f, 0x40, 0x4f, 0x80, 0x4f, 0xc0, 0x4f,
},
},
{
name: "fp16-fp16",
dtype: "F16",
size: 32 * 2, // 32 floats, each 2 bytes
shape: []uint64{16, 2},
setup: func(t *testing.T, f *os.File) {
u16s := make([]uint16, 32)
for i := range u16s {
u16s[i] = float16.Fromfloat32(float32(i)).Bits()
}
if err := binary.Write(f, binary.LittleEndian, u16s); err != nil {
t.Fatal(err)
}
},
want: []byte{
0x00, 0x00, 0x00, 0x3c, 0x00, 0x40, 0x00, 0x42, 0x00, 0x44, 0x00, 0x45, 0x00, 0x46, 0x00, 0x47,
0x00, 0x48, 0x80, 0x48, 0x00, 0x49, 0x80, 0x49, 0x00, 0x4a, 0x80, 0x4a, 0x00, 0x4b, 0x80, 0x4b,
0x00, 0x4c, 0x40, 0x4c, 0x80, 0x4c, 0xc0, 0x4c, 0x00, 0x4d, 0x40, 0x4d, 0x80, 0x4d, 0xc0, 0x4d,
0x00, 0x4e, 0x40, 0x4e, 0x80, 0x4e, 0xc0, 0x4e, 0x00, 0x4f, 0x40, 0x4f, 0x80, 0x4f, 0xc0, 0x4f,
},
},
{
name: "fp16-fp32",
dtype: "F16",
size: 32 * 2, // 32 floats, each 2 bytes
shape: []uint64{32},
setup: func(t *testing.T, f *os.File) {
u16s := make([]uint16, 32)
for i := range u16s {
u16s[i] = float16.Fromfloat32(float32(i)).Bits()
}
if err := binary.Write(f, binary.LittleEndian, u16s); err != nil {
t.Fatal(err)
}
},
want: []byte{
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x3f, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x40, 0x40,
0x00, 0x00, 0x80, 0x40, 0x00, 0x00, 0xa0, 0x40, 0x00, 0x00, 0xc0, 0x40, 0x00, 0x00, 0xe0, 0x40,
0x00, 0x00, 0x00, 0x41, 0x00, 0x00, 0x10, 0x41, 0x00, 0x00, 0x20, 0x41, 0x00, 0x00, 0x30, 0x41,
0x00, 0x00, 0x40, 0x41, 0x00, 0x00, 0x50, 0x41, 0x00, 0x00, 0x60, 0x41, 0x00, 0x00, 0x70, 0x41,
0x00, 0x00, 0x80, 0x41, 0x00, 0x00, 0x88, 0x41, 0x00, 0x00, 0x90, 0x41, 0x00, 0x00, 0x98, 0x41,
0x00, 0x00, 0xa0, 0x41, 0x00, 0x00, 0xa8, 0x41, 0x00, 0x00, 0xb0, 0x41, 0x00, 0x00, 0xb8, 0x41,
0x00, 0x00, 0xc0, 0x41, 0x00, 0x00, 0xc8, 0x41, 0x00, 0x00, 0xd0, 0x41, 0x00, 0x00, 0xd8, 0x41,
0x00, 0x00, 0xe0, 0x41, 0x00, 0x00, 0xe8, 0x41, 0x00, 0x00, 0xf0, 0x41, 0x00, 0x00, 0xf8, 0x41,
},
},
{
name: "bf16-bf16",
dtype: "BF16",
size: 32 * 2, // 32 brain floats, each 2 bytes
shape: []uint64{16, 2},
setup: func(t *testing.T, f *os.File) {
f32s := make([]float32, 32)
for i := range f32s {
f32s[i] = float32(i)
}
if err := binary.Write(f, binary.LittleEndian, bfloat16.EncodeFloat32(f32s)); err != nil {
t.Fatal(err)
}
},
want: []byte{
0x00, 0x00, 0x80, 0x3f, 0x00, 0x40, 0x40, 0x40, 0x80, 0x40, 0xa0, 0x40, 0xc0, 0x40, 0xe0, 0x40,
0x00, 0x41, 0x10, 0x41, 0x20, 0x41, 0x30, 0x41, 0x40, 0x41, 0x50, 0x41, 0x60, 0x41, 0x70, 0x41,
0x80, 0x41, 0x88, 0x41, 0x90, 0x41, 0x98, 0x41, 0xa0, 0x41, 0xa8, 0x41, 0xb0, 0x41, 0xb8, 0x41,
0xc0, 0x41, 0xc8, 0x41, 0xd0, 0x41, 0xd8, 0x41, 0xe0, 0x41, 0xe8, 0x41, 0xf0, 0x41, 0xf8, 0x41,
},
},
{
name: "bf16-fp32",
dtype: "BF16",
size: 32 * 2, // 32 brain floats, each 2 bytes
shape: []uint64{32},
setup: func(t *testing.T, f *os.File) {
f32s := make([]float32, 32)
for i := range f32s {
f32s[i] = float32(i)
}
if err := binary.Write(f, binary.LittleEndian, bfloat16.EncodeFloat32(f32s)); err != nil {
t.Fatal(err)
}
},
want: []byte{
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x3f, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x40, 0x40,
0x00, 0x00, 0x80, 0x40, 0x00, 0x00, 0xa0, 0x40, 0x00, 0x00, 0xc0, 0x40, 0x00, 0x00, 0xe0, 0x40,
0x00, 0x00, 0x00, 0x41, 0x00, 0x00, 0x10, 0x41, 0x00, 0x00, 0x20, 0x41, 0x00, 0x00, 0x30, 0x41,
0x00, 0x00, 0x40, 0x41, 0x00, 0x00, 0x50, 0x41, 0x00, 0x00, 0x60, 0x41, 0x00, 0x00, 0x70, 0x41,
0x00, 0x00, 0x80, 0x41, 0x00, 0x00, 0x88, 0x41, 0x00, 0x00, 0x90, 0x41, 0x00, 0x00, 0x98, 0x41,
0x00, 0x00, 0xa0, 0x41, 0x00, 0x00, 0xa8, 0x41, 0x00, 0x00, 0xb0, 0x41, 0x00, 0x00, 0xb8, 0x41,
0x00, 0x00, 0xc0, 0x41, 0x00, 0x00, 0xc8, 0x41, 0x00, 0x00, 0xd0, 0x41, 0x00, 0x00, 0xd8, 0x41,
0x00, 0x00, 0xe0, 0x41, 0x00, 0x00, 0xe8, 0x41, 0x00, 0x00, 0xf0, 0x41, 0x00, 0x00, 0xf8, 0x41,
},
},
{
name: "u8-u8",
dtype: "U8",
size: 32, // 32 brain floats, each 1 bytes
shape: []uint64{32},
setup: func(t *testing.T, f *os.File) {
u8s := make([]uint8, 32)
for i := range u8s {
u8s[i] = uint8(i)
}
if err := binary.Write(f, binary.LittleEndian, u8s); err != nil {
t.Fatal(err)
}
},
want: []byte{
0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f,
},
},
}
for _, tt := range cases {
t.Run(tt.name, func(t *testing.T) {
path := filepath.Base(t.Name())
st := safetensor{
fs: root.FS(),
path: path,
dtype: tt.dtype,
offset: tt.offset,
size: tt.size,
tensorBase: &tensorBase{
name: tt.name,
shape: tt.shape,
},
}
f, err := root.Create(path)
if err != nil {
t.Fatal(err)
}
defer f.Close()
tt.setup(t, f)
var b bytes.Buffer
if _, err := st.WriteTo(&b); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(tt.want, b.Bytes()); diff != "" {
t.Errorf("safetensor.WriteTo() mismatch (-want +got):\n%s", diff)
}
})
}
}
func TestSafetensorKind(t *testing.T) {
tests := []struct {
name string
st safetensor
expected uint32
}{
{
name: "BF16 dtype with non-v. prefix and non-FP32 base kind should return BF16",
st: safetensor{
tensorBase: &tensorBase{
name: "weight.matrix",
shape: []uint64{10, 10}, // will default to FP16
},
dtype: "BF16",
},
expected: tensorKindBF16,
},
{
name: "BF16 dtype with v. prefix should return base kind",
st: safetensor{
tensorBase: &tensorBase{
name: "v.weight.matrix",
shape: []uint64{10, 10}, // will default to FP16
},
dtype: "BF16",
},
expected: tensorKindFP16,
},
{
name: "BF16 dtype with FP32 base kind should return FP32",
st: safetensor{
tensorBase: &tensorBase{
name: "weight.matrix",
shape: []uint64{10}, // will default to FP32
},
dtype: "BF16",
},
expected: tensorKindFP32,
},
{
name: "Non-BF16 dtype should return base kind",
st: safetensor{
tensorBase: &tensorBase{
name: "weight.matrix",
shape: []uint64{10, 10}, // will default to FP16
},
dtype: "FP16",
},
expected: tensorKindFP16,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := tt.st.Kind()
if result != tt.expected {
t.Errorf("Kind() = %d, expected %d", result, tt.expected)
}
})
}
}

View File

@@ -16,7 +16,8 @@ import (
type split struct {
*strings.Replacer
dim int
dim int
slices []tensor.Slice
// fn is an optional function to apply to the tensor after slicing
fn func(tensor.Tensor) (tensor.Tensor, error)
@@ -32,9 +33,12 @@ func splitDim(t Tensor, dim int, splits ...split) iter.Seq[*ggml.Tensor] {
shape := slices.Clone(t.Shape())
shape[dim] = cmp.Or(uint64(split.dim), shape[dim]/uint64(len(splits)))
slice := slices.Repeat([]tensor.Slice{nil}, len(shape))
slice[dim] = tensor.S(offset, offset+int(shape[dim]))
offset += int(shape[dim])
slice := split.slices
if len(slice) == 0 {
slice = slices.Repeat([]tensor.Slice{nil}, len(shape))
slice[dim] = tensor.S(offset, offset+int(shape[dim]))
offset += int(shape[dim])
}
t.SetRepacker(func(_ string, data []float32, shape []uint64) ([]float32, error) {
dims := make([]int, len(shape))

View File

@@ -72,236 +72,787 @@ func mul(shape []uint64) int {
}
func TestSplitDim(t *testing.T) {
r := fakeTensor{
name: "a.b",
shape: []uint64{3, 4},
data: []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11},
}
t.Run("no split", func(t *testing.T) {
for tt := range splitDim(&r, 0, split{Replacer: strings.NewReplacer("a", "x")}) {
if tt.Name != "x.b" {
t.Fatalf("expected name 'x', got '%s'", tt.Name)
}
if !slices.Equal(tt.Shape, []uint64{3, 4}) {
t.Fatalf("expected shape [3, 4], got %v", tt.Shape)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if !slices.Equal(f32s, []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}) {
t.Fatalf("expected data [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], got %v", f32s)
}
t.Run("2d", func(t *testing.T) {
r := fakeTensor{
name: "a.b",
shape: []uint64{3, 4},
data: []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11},
}
t.Run("no split", func(t *testing.T) {
for tt := range splitDim(&r, 0, split{Replacer: strings.NewReplacer("a", "x")}) {
if tt.Name != "x.b" {
t.Fatalf("expected name 'x', got '%s'", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 4}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("even split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 1,
split{Replacer: strings.NewReplacer("a", "x")},
split{Replacer: strings.NewReplacer("b", "y")},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 4, 5, 8, 9}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{2, 3, 6, 7, 10, 11}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("uneven split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 0,
split{Replacer: strings.NewReplacer("a", "x"), dim: 2},
split{Replacer: strings.NewReplacer("b", "y"), dim: 1},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{2, 4}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3, 4, 5, 6, 7}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{8, 9, 10, 11}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("three way split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 0,
split{Replacer: strings.NewReplacer("a", "x"), dim: 1},
split{Replacer: strings.NewReplacer("b", "y"), dim: 1},
split{Replacer: strings.NewReplacer("b", "z"), dim: 1},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{4, 5, 6, 7}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.z" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{8, 9, 10, 11}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("uneven three way split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 1,
split{Replacer: strings.NewReplacer("a", "x"), dim: 2},
split{Replacer: strings.NewReplacer("b", "y"), dim: 1},
split{Replacer: strings.NewReplacer("b", "z"), dim: 1},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 4, 5, 8, 9}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 1}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{2, 6, 10}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.z" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 1}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{3, 7, 11}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("split with transpose", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 1,
split{Replacer: strings.NewReplacer("a", "x")},
split{Replacer: strings.NewReplacer("b", "y"), fn: func(tt tensor.Tensor) (tensor.Tensor, error) {
return tensor.Transpose(tt, 1, 0)
}},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 4, 5, 8, 9}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{2, 6, 10, 3, 7, 11}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
})
t.Run("even split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 1,
split{Replacer: strings.NewReplacer("a", "x")},
split{Replacer: strings.NewReplacer("b", "y")},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if !slices.Equal(tt.Shape, []uint64{3, 2}) {
t.Fatal("expected shape [3, 2], got", tt.Shape)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if !slices.Equal(f32s, []float32{0, 1, 4, 5, 8, 9}) {
t.Fatal("expected data [0, 1, 4, 5, 8, 9], got", f32s)
}
t.Run("3d", func(t *testing.T) {
r := fakeTensor{
name: "a.b",
shape: []uint64{3, 4, 2},
data: []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23},
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
t.Run("no split", func(t *testing.T) {
for tt := range splitDim(&r, 0, split{Replacer: strings.NewReplacer("a", "x")}) {
if tt.Name != "x.b" {
t.Fatalf("expected name 'x', got '%s'", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 4, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("even split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 1,
split{Replacer: strings.NewReplacer("a", "x")},
split{Replacer: strings.NewReplacer("b", "y")},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("uneven split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 0,
split{Replacer: strings.NewReplacer("a", "x"), dim: 2},
split{Replacer: strings.NewReplacer("b", "y"), dim: 1},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{2, 4, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
if !slices.Equal(tt.Shape, []uint64{3, 2}) {
t.Fatal("expected shape [3, 2], got", tt.Shape)
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{16, 17, 18, 19, 20, 21, 22, 23}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("three way split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 0,
split{Replacer: strings.NewReplacer("a", "x"), dim: 1},
split{Replacer: strings.NewReplacer("b", "y"), dim: 1},
split{Replacer: strings.NewReplacer("b", "z"), dim: 1},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3, 4, 5, 6, 7}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{8, 9, 10, 11, 12, 13, 14, 15}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.z" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{1, 4, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{16, 17, 18, 19, 20, 21, 22, 23}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
})
t.Run("uneven three way split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 1,
split{Replacer: strings.NewReplacer("a", "x"), dim: 2},
split{Replacer: strings.NewReplacer("b", "y"), dim: 1},
split{Replacer: strings.NewReplacer("b", "z"), dim: 1},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 2, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
if !slices.Equal(f32s, []float32{2, 3, 6, 7, 10, 11}) {
t.Fatal("expected data [2, 3, 6, 7, 10, 11], got", f32s)
}
}
})
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
t.Run("uneven split", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 0,
split{Replacer: strings.NewReplacer("a", "x"), dim: 2},
split{Replacer: strings.NewReplacer("b", "y"), dim: 1},
))
defer stop()
if tt.Name != "a.y" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
if diff := cmp.Diff(tt.Shape, []uint64{3, 1, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(f32s, []float32{4, 5, 12, 13, 20, 21}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if !slices.Equal(tt.Shape, []uint64{2, 4}) {
t.Fatal("expected shape [2, 4], got", tt.Shape)
}
if tt.Name != "a.z" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(tt.Shape, []uint64{3, 1, 2}); diff != "" {
t.Errorf("unexpected shape (-want +got):\n%s", diff)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
if !slices.Equal(f32s, []float32{0, 1, 2, 3, 4, 5, 6, 7}) {
t.Fatal("expected data [0, 1, 2, 3, 4, 5, 6, 7], got", f32s)
}
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
if diff := cmp.Diff(f32s, []float32{6, 7, 14, 15, 22, 23}); diff != "" {
t.Errorf("unexpected data (-want +got):\n%s", diff)
}
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
}
if !slices.Equal(tt.Shape, []uint64{1, 4}) {
t.Fatal("expected shape [1, 4], got", tt.Shape)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if !slices.Equal(f32s, []float32{8, 9, 10, 11}) {
t.Fatal("expected data [8, 9, 10, 11], got", f32s)
}
}
})
t.Run("split with transpose", func(t *testing.T) {
next, stop := iter.Pull(splitDim(&r, 1,
split{Replacer: strings.NewReplacer("a", "x")},
split{Replacer: strings.NewReplacer("b", "y"), fn: func(tt tensor.Tensor) (tensor.Tensor, error) {
return tensor.Transpose(tt, 1, 0)
}},
))
defer stop()
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "x.b" {
t.Fatal("expected name 'x.b', got", tt.Name)
}
if !slices.Equal(tt.Shape, []uint64{3, 2}) {
t.Fatal("expected shape [3, 2], got", tt.Shape)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if !slices.Equal(f32s, []float32{0, 1, 4, 5, 8, 9}) {
t.Fatal("expected data [0, 1, 4, 5, 8, 9], got", f32s)
}
}
{
tt, ok := next()
if !ok {
t.Fatal("expected at least one split")
}
if tt.Name != "a.y" {
t.Fatal("expected name 'a.y', got", tt.Name)
}
if !slices.Equal(tt.Shape, []uint64{3, 2}) {
t.Fatal("expected shape [3, 2], got", tt.Shape)
}
var b bytes.Buffer
if _, err := tt.WriteTo(&b); err != nil {
t.Fatal(err)
}
f32s := make([]float32, mul(tt.Shape))
if err := binary.Read(&b, binary.LittleEndian, &f32s); err != nil {
t.Fatal(err)
}
if !slices.Equal(f32s, []float32{2, 6, 10, 3, 7, 11}) {
t.Fatal("expected data [2, 6, 10, 3, 7, 11], got", f32s)
}
}
})
})
}

View File

@@ -1,83 +0,0 @@
//go:build linux || windows
package discover
import (
"errors"
"log/slog"
"os"
"path/filepath"
"runtime"
"strings"
)
// Determine if the given ROCm lib directory is usable by checking for existence of some glob patterns
func rocmLibUsable(libDir string) bool {
slog.Debug("evaluating potential rocm lib dir " + libDir)
for _, g := range ROCmLibGlobs {
res, _ := filepath.Glob(filepath.Join(libDir, g))
if len(res) == 0 {
return false
}
}
return true
}
func GetSupportedGFX(libDir string) ([]string, error) {
var ret []string
files, err := filepath.Glob(filepath.Join(libDir, "rocblas", "library", "TensileLibrary_lazy_gfx*.dat"))
if err != nil {
return nil, err
}
for _, file := range files {
ret = append(ret, strings.TrimSuffix(strings.TrimPrefix(filepath.Base(file), "TensileLibrary_lazy_"), ".dat"))
}
return ret, nil
}
func commonAMDValidateLibDir() (string, error) {
// Favor our bundled version
// Installer payload location if we're running the installed binary
rocmTargetDir := filepath.Join(LibOllamaPath, "rocm")
if rocmLibUsable(rocmTargetDir) {
slog.Debug("detected ROCM next to ollama executable " + rocmTargetDir)
return rocmTargetDir, nil
}
// Prefer explicit HIP env var
hipPath := os.Getenv("HIP_PATH")
if hipPath != "" {
hipLibDir := filepath.Join(hipPath, "bin")
if rocmLibUsable(hipLibDir) {
slog.Debug("detected ROCM via HIP_PATH=" + hipPath)
return hipLibDir, nil
}
}
// Scan the LD_LIBRARY_PATH or PATH
pathEnv := "LD_LIBRARY_PATH"
if runtime.GOOS == "windows" {
pathEnv = "PATH"
}
paths := os.Getenv(pathEnv)
for _, path := range filepath.SplitList(paths) {
d, err := filepath.Abs(path)
if err != nil {
continue
}
if rocmLibUsable(d) {
return d, nil
}
}
// Well known location(s)
for _, path := range RocmStandardLocations {
if rocmLibUsable(path) {
return path, nil
}
}
return "", errors.New("no suitable rocm found, falling back to CPU")
}

View File

@@ -1,147 +0,0 @@
package discover
import (
"errors"
"fmt"
"log/slog"
"syscall"
"unsafe"
"golang.org/x/sys/windows"
)
const (
hipSuccess = 0
hipErrorNoDevice = 100
)
type hipDevicePropMinimal struct {
Name [256]byte
unused1 [140]byte
GcnArchName [256]byte // gfx####
iGPU int // Doesn't seem to actually report correctly
unused2 [128]byte
}
// Wrap the amdhip64.dll library for GPU discovery
type HipLib struct {
dll windows.Handle
hipGetDeviceCount uintptr
hipGetDeviceProperties uintptr
hipMemGetInfo uintptr
hipSetDevice uintptr
hipDriverGetVersion uintptr
}
func NewHipLib() (*HipLib, error) {
// At runtime we depend on v6, so discover GPUs with the same library for a consistent set of GPUs
h, err := windows.LoadLibrary("amdhip64_6.dll")
if err != nil {
return nil, fmt.Errorf("unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: %w", err)
}
hl := &HipLib{}
hl.dll = h
hl.hipGetDeviceCount, err = windows.GetProcAddress(hl.dll, "hipGetDeviceCount")
if err != nil {
return nil, err
}
hl.hipGetDeviceProperties, err = windows.GetProcAddress(hl.dll, "hipGetDeviceProperties")
if err != nil {
return nil, err
}
hl.hipMemGetInfo, err = windows.GetProcAddress(hl.dll, "hipMemGetInfo")
if err != nil {
return nil, err
}
hl.hipSetDevice, err = windows.GetProcAddress(hl.dll, "hipSetDevice")
if err != nil {
return nil, err
}
hl.hipDriverGetVersion, err = windows.GetProcAddress(hl.dll, "hipDriverGetVersion")
if err != nil {
return nil, err
}
return hl, nil
}
// The hip library only evaluates the ROCR_VISIBLE_DEVICES variable at startup
// so we have to unload/reset the library after we do our initial discovery
// to make sure our updates to that variable are processed by llama.cpp
func (hl *HipLib) Release() {
err := windows.FreeLibrary(hl.dll)
if err != nil {
slog.Warn("failed to unload amdhip64.dll", "error", err)
}
hl.dll = 0
}
func (hl *HipLib) AMDDriverVersion() (driverMajor, driverMinor int, err error) {
if hl.dll == 0 {
return 0, 0, errors.New("dll has been unloaded")
}
var version int
status, _, err := syscall.SyscallN(hl.hipDriverGetVersion, uintptr(unsafe.Pointer(&version)))
if status != hipSuccess {
return 0, 0, fmt.Errorf("failed call to hipDriverGetVersion: %d %s", status, err)
}
slog.Debug("hipDriverGetVersion", "version", version)
driverMajor = version / 10000000
driverMinor = (version - (driverMajor * 10000000)) / 100000
return driverMajor, driverMinor, nil
}
func (hl *HipLib) HipGetDeviceCount() int {
if hl.dll == 0 {
slog.Error("dll has been unloaded")
return 0
}
var count int
status, _, err := syscall.SyscallN(hl.hipGetDeviceCount, uintptr(unsafe.Pointer(&count)))
if status == hipErrorNoDevice {
slog.Info("AMD ROCm reports no devices found")
return 0
}
if status != hipSuccess {
slog.Warn("failed call to hipGetDeviceCount", "status", status, "error", err)
}
return count
}
func (hl *HipLib) HipSetDevice(device int) error {
if hl.dll == 0 {
return errors.New("dll has been unloaded")
}
status, _, err := syscall.SyscallN(hl.hipSetDevice, uintptr(device))
if status != hipSuccess {
return fmt.Errorf("failed call to hipSetDevice: %d %s", status, err)
}
return nil
}
func (hl *HipLib) HipGetDeviceProperties(device int) (*hipDevicePropMinimal, error) {
if hl.dll == 0 {
return nil, errors.New("dll has been unloaded")
}
var props hipDevicePropMinimal
status, _, err := syscall.SyscallN(hl.hipGetDeviceProperties, uintptr(unsafe.Pointer(&props)), uintptr(device))
if status != hipSuccess {
return nil, fmt.Errorf("failed call to hipGetDeviceProperties: %d %s", status, err)
}
return &props, nil
}
// free, total, err
func (hl *HipLib) HipMemGetInfo() (uint64, uint64, error) {
if hl.dll == 0 {
return 0, 0, errors.New("dll has been unloaded")
}
var totalMemory uint64
var freeMemory uint64
status, _, err := syscall.SyscallN(hl.hipMemGetInfo, uintptr(unsafe.Pointer(&freeMemory)), uintptr(unsafe.Pointer(&totalMemory)))
if status != hipSuccess {
return 0, 0, fmt.Errorf("failed call to hipMemGetInfo: %d %s", status, err)
}
return freeMemory, totalMemory, nil
}

View File

@@ -1,538 +0,0 @@
package discover
import (
"bufio"
"errors"
"fmt"
"io"
"io/fs"
"log/slog"
"os"
"path/filepath"
"regexp"
"slices"
"sort"
"strconv"
"strings"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/format"
)
// Discovery logic for AMD/ROCm GPUs
const (
DriverVersionFile = "/sys/module/amdgpu/version"
AMDNodesSysfsDir = "/sys/class/kfd/kfd/topology/nodes/"
GPUPropertiesFileGlob = AMDNodesSysfsDir + "*/properties"
// Prefix with the node dir
GPUTotalMemoryFileGlob = "mem_banks/*/properties" // size_in_bytes line
// Direct Rendering Manager sysfs location
DRMDeviceDirGlob = "/sys/class/drm/card*/device"
DRMTotalMemoryFile = "mem_info_vram_total"
DRMUsedMemoryFile = "mem_info_vram_used"
// In hex; properties file is in decimal
DRMUniqueIDFile = "unique_id"
DRMVendorFile = "vendor"
DRMDeviceFile = "device"
)
var (
// Used to validate if the given ROCm lib is usable
ROCmLibGlobs = []string{"libhipblas.so.2*", "rocblas"} // TODO - probably include more coverage of files here...
RocmStandardLocations = []string{"/opt/rocm/lib", "/usr/lib64"}
)
// Gather GPU information from the amdgpu driver if any supported GPUs are detected
// Only called once during bootstrap
func AMDGetGPUInfo() ([]RocmGPUInfo, error) {
resp := []RocmGPUInfo{}
if !AMDDetected() {
return resp, fmt.Errorf("AMD GPUs not detected")
}
// Opportunistic logging of driver version to aid in troubleshooting
driverMajor, driverMinor, err := AMDDriverVersion()
if err != nil {
// TODO - if we see users crash and burn with the upstreamed kernel this can be adjusted to hard-fail rocm support and fallback to CPU
slog.Warn("ollama recommends running the https://www.amd.com/en/support/linux-drivers", "error", err)
}
// Determine if the user has already pre-selected which GPUs to look at, then ignore the others
var visibleDevices []string
hipVD := envconfig.HipVisibleDevices() // zero based index only
rocrVD := envconfig.RocrVisibleDevices() // zero based index or UUID
gpuDO := envconfig.GpuDeviceOrdinal() // zero based index
switch {
case rocrVD != "":
visibleDevices = strings.Split(rocrVD, ",")
case hipVD != "":
visibleDevices = strings.Split(hipVD, ",")
case gpuDO != "":
visibleDevices = strings.Split(gpuDO, ",")
}
gfxOverride := envconfig.HsaOverrideGfxVersion()
var supported []string
var libDir string
// The amdgpu driver always exposes the host CPU(s) first, but we have to skip them and subtract
// from the other IDs to get alignment with the HIP libraries expectations (zero is the first GPU, not the CPU)
matches, _ := filepath.Glob(GPUPropertiesFileGlob)
sort.Slice(matches, func(i, j int) bool {
// /sys/class/kfd/kfd/topology/nodes/<number>/properties
a, err := strconv.ParseInt(filepath.Base(filepath.Dir(matches[i])), 10, 64)
if err != nil {
slog.Debug("parse err", "error", err, "match", matches[i])
return false
}
b, err := strconv.ParseInt(filepath.Base(filepath.Dir(matches[j])), 10, 64)
if err != nil {
slog.Debug("parse err", "error", err, "match", matches[i])
return false
}
return a < b
})
gpuCount := 0
for _, match := range matches {
slog.Debug("evaluating amdgpu node " + match)
fp, err := os.Open(match)
if err != nil {
slog.Debug("failed to open sysfs node", "file", match, "error", err)
continue
}
defer fp.Close()
scanner := bufio.NewScanner(fp)
isCPU := false
var major, minor, patch uint64
var vendor, device, uniqueID uint64
for scanner.Scan() {
line := strings.TrimSpace(scanner.Text())
// Note: we could also use "cpu_cores_count X" where X is greater than zero to detect CPUs
if strings.HasPrefix(line, "gfx_target_version") {
ver := strings.Fields(line)
// Detect CPUs
if len(ver) == 2 && ver[1] == "0" {
slog.Debug("detected CPU " + match)
isCPU = true
break
}
if len(ver) != 2 || len(ver[1]) < 5 {
slog.Warn("malformed "+match, "gfx_target_version", line)
// If this winds up being a CPU, our offsets may be wrong
continue
}
l := len(ver[1])
var err1, err2, err3 error
patch, err1 = strconv.ParseUint(ver[1][l-2:l], 10, 32)
minor, err2 = strconv.ParseUint(ver[1][l-4:l-2], 10, 32)
major, err3 = strconv.ParseUint(ver[1][:l-4], 10, 32)
if err1 != nil || err2 != nil || err3 != nil {
slog.Debug("malformed int " + line)
continue
}
} else if strings.HasPrefix(line, "vendor_id") {
ver := strings.Fields(line)
if len(ver) != 2 {
slog.Debug("malformed", "vendor_id", line)
continue
}
vendor, err = strconv.ParseUint(ver[1], 10, 64)
if err != nil {
slog.Debug("malformed", "vendor_id", line, "error", err)
}
} else if strings.HasPrefix(line, "device_id") {
ver := strings.Fields(line)
if len(ver) != 2 {
slog.Debug("malformed", "device_id", line)
continue
}
device, err = strconv.ParseUint(ver[1], 10, 64)
if err != nil {
slog.Debug("malformed", "device_id", line, "error", err)
}
} else if strings.HasPrefix(line, "unique_id") {
ver := strings.Fields(line)
if len(ver) != 2 {
slog.Debug("malformed", "unique_id", line)
continue
}
uniqueID, err = strconv.ParseUint(ver[1], 10, 64)
if err != nil {
slog.Debug("malformed", "unique_id", line, "error", err)
}
}
// TODO - any other properties we want to extract and record?
// vendor_id + device_id -> pci lookup for "Name"
// Other metrics that may help us understand relative performance between multiple GPUs
}
// Note: while ./mem_banks/*/used_memory exists, it doesn't appear to take other VRAM consumers
// into consideration, so we instead map the device over to the DRM driver sysfs nodes which
// do reliably report VRAM usage.
if isCPU {
continue
}
// Skip over any GPUs that are masked
if major == 0 && minor == 0 && patch == 0 {
slog.Debug("skipping gpu with gfx000")
continue
}
// Keep track of numeric IDs based on valid GPUs
gpuID := gpuCount
gpuCount += 1
// Look up the memory for the current node
totalMemory := uint64(0)
usedMemory := uint64(0)
var usedFile string
mapping := []struct {
id uint64
filename string
}{
{vendor, DRMVendorFile},
{device, DRMDeviceFile},
{uniqueID, DRMUniqueIDFile}, // Not all devices will report this
}
slog.Debug("mapping amdgpu to drm sysfs nodes", "amdgpu", match, "vendor", vendor, "device", device, "unique_id", uniqueID)
// Map over to DRM location to find the total/free memory
drmMatches, _ := filepath.Glob(DRMDeviceDirGlob)
for _, devDir := range drmMatches {
matched := true
for _, m := range mapping {
if m.id == 0 {
// Null ID means it didn't populate, so we can't use it to match
continue
}
filename := filepath.Join(devDir, m.filename)
buf, err := os.ReadFile(filename)
if err != nil {
slog.Debug("failed to read sysfs node", "file", filename, "error", err)
matched = false
break
}
// values here are in hex, strip off the lead 0x and parse so we can compare the numeric (decimal) values in amdgpu
cmp, err := strconv.ParseUint(strings.TrimPrefix(strings.TrimSpace(string(buf)), "0x"), 16, 64)
if err != nil {
slog.Debug("failed to parse sysfs node", "file", filename, "error", err)
matched = false
break
}
if cmp != m.id {
matched = false
break
}
}
if !matched {
continue
}
// Found the matching DRM directory
slog.Debug("matched", "amdgpu", match, "drm", devDir)
totalFile := filepath.Join(devDir, DRMTotalMemoryFile)
buf, err := os.ReadFile(totalFile)
if err != nil {
slog.Debug("failed to read sysfs node", "file", totalFile, "error", err)
break
}
totalMemory, err = strconv.ParseUint(strings.TrimSpace(string(buf)), 10, 64)
if err != nil {
slog.Debug("failed to parse sysfs node", "file", totalFile, "error", err)
break
}
usedFile = filepath.Join(devDir, DRMUsedMemoryFile)
usedMemory, err = getFreeMemory(usedFile)
if err != nil {
slog.Debug("failed to update used memory", "error", err)
}
break
}
var name string
// TODO - PCI ID lookup
if vendor > 0 && device > 0 {
name = fmt.Sprintf("%04x:%04x", vendor, device)
}
// Favor UUIDs if available to reduce possibility of getting the numeric IDs wrong
var ID string
if uniqueID != 0 {
ID = fmt.Sprintf("GPU-%016x", uniqueID)
} else {
ID = strconv.Itoa(gpuID)
}
gpuInfo := RocmGPUInfo{
GpuInfo: GpuInfo{
Library: "rocm",
memInfo: memInfo{
TotalMemory: totalMemory,
FreeMemory: (totalMemory - usedMemory),
},
ID: ID,
Name: name,
Compute: fmt.Sprintf("gfx%d%x%x", major, minor, patch),
MinimumMemory: rocmMinimumMemory,
DriverMajor: driverMajor,
DriverMinor: driverMinor,
},
usedFilepath: usedFile,
index: gpuID,
}
// iGPU detection, remove this check once we can support an iGPU variant of the rocm library
if totalMemory < IGPUMemLimit {
reason := "unsupported Radeon iGPU detected skipping"
slog.Info(reason, "id", gpuID, "total", format.HumanBytes2(totalMemory))
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: reason,
})
continue
}
minVer, err := strconv.Atoi(RocmComputeMajorMin)
if err != nil {
slog.Error("invalid RocmComputeMajorMin setting", "value", RocmComputeMajorMin, "error", err)
}
if int(major) < minVer {
reason := fmt.Sprintf("amdgpu too old gfx%d%x%x", major, minor, patch)
slog.Warn(reason, "gpu", gpuID)
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: reason,
})
continue
}
slog.Debug("amdgpu memory", "gpu", gpuID, "total", format.HumanBytes2(totalMemory))
slog.Debug("amdgpu memory", "gpu", gpuID, "available", format.HumanBytes2(totalMemory-usedMemory))
// If the user wants to filter to a subset of devices, filter out if we aren't a match
if len(visibleDevices) > 0 {
include := false
for _, visible := range visibleDevices {
if visible == gpuInfo.ID || visible == strconv.Itoa(gpuInfo.index) {
include = true
break
}
}
if !include {
reason := "filtering out device per user request"
slog.Info(reason, "id", gpuInfo.ID, "visible_devices", visibleDevices)
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: reason,
})
continue
}
}
// Final validation is gfx compatibility - load the library if we haven't already loaded it
// even if the user overrides, we still need to validate the library
if libDir == "" {
libDir, err = AMDValidateLibDir()
if err != nil {
err = fmt.Errorf("unable to verify rocm library: %w", err)
slog.Warn(err.Error())
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: err.Error(),
})
return nil, err
}
}
gpuInfo.DependencyPath = []string{libDir}
if gfxOverride == "" {
// Only load supported list once
if len(supported) == 0 {
supported, err = GetSupportedGFX(libDir)
if err != nil {
err = fmt.Errorf("failed to lookup supported GFX types: %w", err)
slog.Warn(err.Error())
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: err.Error(),
})
return nil, err
}
slog.Debug("rocm supported GPUs", "types", supported)
}
gfx := gpuInfo.Compute
if !slices.Contains[[]string, string](supported, gfx) {
reason := fmt.Sprintf("amdgpu is not supported (supported types:%s)", supported)
slog.Warn(reason, "gpu_type", gfx, "gpu", gpuInfo.ID, "library", libDir)
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: reason,
})
// TODO - consider discrete markdown just for ROCM troubleshooting?
slog.Warn("See https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides for HSA_OVERRIDE_GFX_VERSION usage")
continue
} else {
slog.Info("amdgpu is supported", "gpu", gpuInfo.ID, "gpu_type", gfx)
}
} else {
slog.Info("skipping rocm gfx compatibility check", "HSA_OVERRIDE_GFX_VERSION", gfxOverride)
}
// Check for env var workarounds
if name == "1002:687f" { // Vega RX 56
gpuInfo.EnvWorkarounds = append(gpuInfo.EnvWorkarounds, [2]string{"HSA_ENABLE_SDMA", "0"})
}
// The GPU has passed all the verification steps and is supported
resp = append(resp, gpuInfo)
}
if len(resp) == 0 {
err := fmt.Errorf("no compatible amdgpu devices detected")
slog.Info(err.Error())
return nil, err
}
if err := verifyKFDDriverAccess(); err != nil {
err = fmt.Errorf("amdgpu devices detected but permission problems block access: %w", err)
slog.Error(err.Error())
return nil, err
}
return resp, nil
}
// Quick check for AMD driver so we can skip amdgpu discovery if not present
func AMDDetected() bool {
// Some driver versions (older?) don't have a version file, so just lookup the parent dir
sysfsDir := filepath.Dir(DriverVersionFile)
_, err := os.Stat(sysfsDir)
if errors.Is(err, os.ErrNotExist) {
slog.Debug("amdgpu driver not detected " + sysfsDir)
return false
} else if err != nil {
slog.Debug("error looking up amd driver", "path", sysfsDir, "error", err)
return false
}
return true
}
// Prefer to use host installed ROCm, as long as it meets our minimum requirements
// failing that, tell the user how to download it on their own
func AMDValidateLibDir() (string, error) {
libDir, err := commonAMDValidateLibDir()
if err == nil {
return libDir, nil
}
// Well known ollama installer path
installedRocmDir := "/usr/share/ollama/lib/rocm"
if rocmLibUsable(installedRocmDir) {
return installedRocmDir, nil
}
// If we still haven't found a usable rocm, the user will have to install it on their own
slog.Warn("amdgpu detected, but no compatible rocm library found. Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install")
return "", errors.New("no suitable rocm found, falling back to CPU")
}
func AMDDriverVersion() (driverMajor, driverMinor int, err error) {
_, err = os.Stat(DriverVersionFile)
if err != nil {
return 0, 0, fmt.Errorf("amdgpu version file missing: %s %w", DriverVersionFile, err)
}
fp, err := os.Open(DriverVersionFile)
if err != nil {
return 0, 0, err
}
defer fp.Close()
verString, err := io.ReadAll(fp)
if err != nil {
return 0, 0, err
}
pattern := `\A(\d+)\.(\d+).*`
regex := regexp.MustCompile(pattern)
match := regex.FindStringSubmatch(string(verString))
if len(match) < 2 {
return 0, 0, fmt.Errorf("malformed version string %s", string(verString))
}
driverMajor, err = strconv.Atoi(match[1])
if err != nil {
return 0, 0, err
}
driverMinor, err = strconv.Atoi(match[2])
if err != nil {
return 0, 0, err
}
return driverMajor, driverMinor, nil
}
func (gpus RocmGPUInfoList) RefreshFreeMemory() error {
if len(gpus) == 0 {
return nil
}
for i := range gpus {
usedMemory, err := getFreeMemory(gpus[i].usedFilepath)
if err != nil {
return err
}
slog.Debug("updating rocm free memory", "gpu", gpus[i].ID, "name", gpus[i].Name, "before", format.HumanBytes2(gpus[i].FreeMemory), "now", format.HumanBytes2(gpus[i].TotalMemory-usedMemory))
gpus[i].FreeMemory = gpus[i].TotalMemory - usedMemory
}
return nil
}
func getFreeMemory(usedFile string) (uint64, error) {
buf, err := os.ReadFile(usedFile)
if err != nil {
return 0, fmt.Errorf("failed to read sysfs node %s %w", usedFile, err)
}
usedMemory, err := strconv.ParseUint(strings.TrimSpace(string(buf)), 10, 64)
if err != nil {
slog.Debug("failed to parse sysfs node", "file", usedFile, "error", err)
return 0, fmt.Errorf("failed to parse sysfs node %s %w", usedFile, err)
}
return usedMemory, nil
}
func verifyKFDDriverAccess() error {
// Verify we have permissions - either running as root, or we have group access to the driver
fd, err := os.OpenFile("/dev/kfd", os.O_RDWR, 0o666)
if err != nil {
if errors.Is(err, fs.ErrPermission) {
return fmt.Errorf("permissions not set up properly. Either run ollama as root, or add you user account to the render group. %w", err)
} else if errors.Is(err, fs.ErrNotExist) {
// Container runtime failure?
return fmt.Errorf("kfd driver not loaded. If running in a container, remember to include '--device /dev/kfd --device /dev/dri'")
}
return fmt.Errorf("failed to check permission on /dev/kfd: %w", err)
}
fd.Close()
return nil
}
func rocmGetVisibleDevicesEnv(gpuInfo []GpuInfo) (string, string) {
ids := []string{}
for _, info := range gpuInfo {
if info.Library != "rocm" {
// TODO shouldn't happen if things are wired correctly...
slog.Debug("rocmGetVisibleDevicesEnv skipping over non-rocm device", "library", info.Library)
continue
}
ids = append(ids, info.ID)
}
// There are 3 potential env vars to use to select GPUs.
// ROCR_VISIBLE_DEVICES supports UUID or numeric so is our preferred on linux
// GPU_DEVICE_ORDINAL supports numeric IDs only
// HIP_VISIBLE_DEVICES supports numeric IDs only
return "ROCR_VISIBLE_DEVICES", strings.Join(ids, ",")
}

View File

@@ -1,218 +0,0 @@
package discover
import (
"bytes"
"errors"
"fmt"
"log/slog"
"path/filepath"
"slices"
"strconv"
"strings"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/format"
)
const (
// TODO We're lookinng for this exact name to detect iGPUs since hipGetDeviceProperties never reports integrated==true
iGPUName = "AMD Radeon(TM) Graphics"
)
var (
// Used to validate if the given ROCm lib is usable
ROCmLibGlobs = []string{"hipblas.dll", "rocblas"} // This is not sufficient to discern v5 vs v6
RocmStandardLocations = []string{"C:\\Program Files\\AMD\\ROCm\\6.1\\bin"} // TODO glob?
)
// Only called once during bootstrap
func AMDGetGPUInfo() ([]RocmGPUInfo, error) {
resp := []RocmGPUInfo{}
hl, err := NewHipLib()
if err != nil {
slog.Debug(err.Error())
return nil, err
}
defer hl.Release()
driverMajor, driverMinor, err := hl.AMDDriverVersion()
if err != nil {
// For now this is benign, but we may eventually need to fail compatibility checks
slog.Debug("error looking up amd driver version", "error", err)
}
// Note: the HIP library automatically handles subsetting to any *_VISIBLE_DEVICES the user specified
count := hl.HipGetDeviceCount()
if count == 0 {
err := fmt.Errorf("no compatible amdgpu devices detected")
slog.Info(err.Error())
return nil, err
}
libDir, err := AMDValidateLibDir()
if err != nil {
err = fmt.Errorf("unable to verify rocm library: %w", err)
slog.Warn(err.Error())
return nil, err
}
var supported []string
gfxOverride := envconfig.HsaOverrideGfxVersion()
if gfxOverride == "" {
supported, err = GetSupportedGFX(libDir)
if err != nil {
err = fmt.Errorf("failed to lookup supported GFX types: %w", err)
slog.Warn(err.Error())
return nil, err
}
} else {
slog.Info("skipping rocm gfx compatibility check", "HSA_OVERRIDE_GFX_VERSION", gfxOverride)
}
slog.Debug("detected hip devices", "count", count)
// TODO how to determine the underlying device ID when visible devices is causing this to subset?
for i := range count {
err = hl.HipSetDevice(i)
if err != nil {
slog.Warn("set device", "id", i, "error", err)
continue
}
props, err := hl.HipGetDeviceProperties(i)
if err != nil {
slog.Warn("get properties", "id", i, "error", err)
continue
}
n := bytes.IndexByte(props.Name[:], 0)
name := string(props.Name[:n])
// TODO is UUID actually populated on windows?
// Can luid be used on windows for setting visible devices (and is it actually set?)
n = bytes.IndexByte(props.GcnArchName[:], 0)
gfx := string(props.GcnArchName[:n])
slog.Debug("hip device", "id", i, "name", name, "gfx", gfx)
// slog.Info(fmt.Sprintf("[%d] Integrated: %d", i, props.iGPU)) // DOESN'T REPORT CORRECTLY! Always 0
// TODO Why isn't props.iGPU accurate!?
freeMemory, totalMemory, err := hl.HipMemGetInfo()
if err != nil {
slog.Warn("get mem info", "id", i, "error", err)
continue
}
gpuInfo := RocmGPUInfo{
GpuInfo: GpuInfo{
Library: "rocm",
memInfo: memInfo{
TotalMemory: totalMemory,
FreeMemory: freeMemory,
},
// Free memory reporting on Windows is not reliable until we bump to ROCm v6.2
UnreliableFreeMemory: true,
ID: strconv.Itoa(i), // TODO this is probably wrong if we specify visible devices
DependencyPath: []string{libDir},
MinimumMemory: rocmMinimumMemory,
Name: name,
Compute: gfx,
DriverMajor: driverMajor,
DriverMinor: driverMinor,
},
index: i,
}
// iGPU detection, remove this check once we can support an iGPU variant of the rocm library
if strings.EqualFold(name, iGPUName) || totalMemory < IGPUMemLimit {
reason := "unsupported Radeon iGPU detected skipping"
slog.Info(reason, "id", gpuInfo.ID, "total", format.HumanBytes2(totalMemory))
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: reason,
})
continue
}
// Strip off Target Features when comparing
if !slices.Contains[[]string, string](supported, strings.Split(gfx, ":")[0]) {
reason := fmt.Sprintf("amdgpu is not supported (supported types:%s)", supported)
slog.Warn(reason, "gpu_type", gfx, "gpu", gpuInfo.ID, "library", libDir)
unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
Reason: reason,
})
// HSA_OVERRIDE_GFX_VERSION not supported on windows
continue
} else {
slog.Debug("amdgpu is supported", "gpu", i, "gpu_type", gfx)
}
slog.Debug("amdgpu memory", "gpu", i, "total", format.HumanBytes2(totalMemory))
slog.Debug("amdgpu memory", "gpu", i, "available", format.HumanBytes2(freeMemory))
resp = append(resp, gpuInfo)
}
return resp, nil
}
func AMDValidateLibDir() (string, error) {
libDir, err := commonAMDValidateLibDir()
if err == nil {
return libDir, nil
}
// Installer payload (if we're running from some other location)
rocmTargetDir := filepath.Join(LibOllamaPath, "rocm")
if rocmLibUsable(rocmTargetDir) {
slog.Debug("detected ollama installed ROCm at " + rocmTargetDir)
return rocmTargetDir, nil
}
// Should not happen on windows since we include it in the installer, but stand-alone binary might hit this
slog.Warn("amdgpu detected, but no compatible rocm library found. Please install ROCm")
return "", errors.New("no suitable rocm found, falling back to CPU")
}
func (gpus RocmGPUInfoList) RefreshFreeMemory() error {
if len(gpus) == 0 {
return nil
}
hl, err := NewHipLib()
if err != nil {
slog.Debug(err.Error())
return err
}
defer hl.Release()
for i := range gpus {
err := hl.HipSetDevice(gpus[i].index)
if err != nil {
return err
}
freeMemory, _, err := hl.HipMemGetInfo()
if err != nil {
slog.Warn("get mem info", "id", i, "error", err)
continue
}
slog.Debug("updating rocm free memory", "gpu", gpus[i].ID, "name", gpus[i].Name, "before", format.HumanBytes2(gpus[i].FreeMemory), "now", format.HumanBytes2(freeMemory))
gpus[i].FreeMemory = freeMemory
}
return nil
}
func rocmGetVisibleDevicesEnv(gpuInfo []GpuInfo) (string, string) {
ids := []string{}
for _, info := range gpuInfo {
if info.Library != "rocm" {
// TODO shouldn't happen if things are wired correctly...
slog.Debug("rocmGetVisibleDevicesEnv skipping over non-rocm device", "library", info.Library)
continue
}
ids = append(ids, info.ID)
}
// There are 3 potential env vars to use to select GPUs.
// ROCR_VISIBLE_DEVICES supports UUID or numeric but does not work on Windows
// HIP_VISIBLE_DEVICES supports numeric IDs only
// GPU_DEVICE_ORDINAL supports numeric IDs only
return "HIP_VISIBLE_DEVICES", strings.Join(ids, ",")
}

View File

@@ -1,24 +0,0 @@
package discover
import (
"os"
"path/filepath"
"runtime"
"strings"
)
func IsNUMA() bool {
if runtime.GOOS != "linux" {
// numa support in llama.cpp is linux only
return false
}
ids := map[string]any{}
packageIds, _ := filepath.Glob("/sys/devices/system/cpu/cpu*/topology/physical_package_id")
for _, packageId := range packageIds {
id, err := os.ReadFile(packageId)
if err == nil {
ids[strings.TrimSpace(string(id))] = struct{}{}
}
}
return len(ids) > 1
}

View File

@@ -4,7 +4,9 @@ import (
"bufio"
"fmt"
"io"
"log/slog"
"os"
"path/filepath"
"reflect"
"regexp"
"sort"
@@ -13,47 +15,6 @@ import (
"github.com/ollama/ollama/format"
)
var CudartGlobs = []string{
"/usr/local/cuda/lib64/libcudart.so*",
"/usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so*",
"/usr/lib/x86_64-linux-gnu/libcudart.so*",
"/usr/lib/wsl/lib/libcudart.so*",
"/usr/lib/wsl/drivers/*/libcudart.so*",
"/opt/cuda/lib64/libcudart.so*",
"/usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so*",
"/usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so*",
"/usr/lib/aarch64-linux-gnu/libcudart.so*",
"/usr/local/cuda/lib*/libcudart.so*",
"/usr/lib*/libcudart.so*",
"/usr/local/lib*/libcudart.so*",
}
var NvmlGlobs = []string{}
var NvcudaGlobs = []string{
"/usr/local/cuda*/targets/*/lib/libcuda.so*",
"/usr/lib/*-linux-gnu/nvidia/current/libcuda.so*",
"/usr/lib/*-linux-gnu/libcuda.so*",
"/usr/lib/wsl/lib/libcuda.so*",
"/usr/lib/wsl/drivers/*/libcuda.so*",
"/opt/cuda/lib*/libcuda.so*",
"/usr/local/cuda/lib*/libcuda.so*",
"/usr/lib*/libcuda.so*",
"/usr/local/lib*/libcuda.so*",
}
var OneapiGlobs = []string{
"/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so*",
"/usr/lib*/libze_intel_gpu.so*",
}
var (
CudartMgmtName = "libcudart.so*"
NvcudaMgmtName = "libcuda.so*"
NvmlMgmtName = "" // not currently wired on linux
OneapiMgmtName = "libze_intel_gpu.so*"
)
func GetCPUMem() (memInfo, error) {
var mem memInfo
var total, available, free, buffers, cached, freeSwap uint64
@@ -106,16 +67,17 @@ type linuxCpuInfo struct {
CoreID string `cpuinfo:"core id"`
}
func GetCPUDetails() ([]CPU, error) {
func GetCPUDetails() []CPU {
file, err := os.Open(CpuInfoFilename)
if err != nil {
return nil, err
slog.Warn("failed to get CPU details", "error", err)
return nil
}
defer file.Close()
return linuxCPUDetails(file)
}
func linuxCPUDetails(file io.Reader) ([]CPU, error) {
func linuxCPUDetails(file io.Reader) []CPU {
reColumns := regexp.MustCompile("\t+: ")
scanner := bufio.NewScanner(file)
cpuInfos := []linuxCpuInfo{}
@@ -194,5 +156,17 @@ func linuxCPUDetails(file io.Reader) ([]CPU, error) {
for _, k := range keys {
result = append(result, *socketByID[k])
}
return result, nil
return result
}
func IsNUMA() bool {
ids := map[string]any{}
packageIds, _ := filepath.Glob("/sys/devices/system/cpu/cpu*/topology/physical_package_id")
for _, packageId := range packageIds {
id, err := os.ReadFile(packageId)
if err == nil {
ids[strings.TrimSpace(string(id))] = struct{}{}
}
}
return len(ids) > 1
}

View File

@@ -2062,18 +2062,9 @@ power management:
for k, v := range testCases {
t.Run(k, func(t *testing.T) {
buf := bytes.NewBufferString(v.input)
cpus, err := linuxCPUDetails(buf)
if err != nil {
t.Fatal(err)
}
cpus := linuxCPUDetails(buf)
slog.Info("example", "scenario", k, "cpus", cpus)
si := SystemInfo{
System: CPUInfo{
CPUs: cpus,
},
}
threadCount := si.GetOptimalThreadCount()
if len(v.expCPUs) != len(cpus) {
t.Fatalf("incorrect number of sockets: expected:%v got:%v", v.expCPUs, cpus)
}
@@ -2088,10 +2079,6 @@ power management:
t.Fatalf("incorrect number of threads: expected:%v got:%v", v.expCPUs[i], c)
}
}
if threadCount != v.expThreadCount {
t.Fatalf("incorrect thread count expected:%d got:%d", v.expThreadCount, threadCount)
}
})
}
}

View File

@@ -26,29 +26,6 @@ var (
GetLogicalProcessorInformationEx = k32.NewProc("GetLogicalProcessorInformationEx")
)
var CudartGlobs = []string{
"c:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v*\\bin\\cudart64_*.dll",
}
var NvmlGlobs = []string{
"c:\\Windows\\System32\\nvml.dll",
}
var NvcudaGlobs = []string{
"c:\\windows\\system*\\nvcuda.dll",
}
var OneapiGlobs = []string{
"c:\\Windows\\System32\\DriverStore\\FileRepository\\*\\ze_intel_gpu64.dll",
}
var (
CudartMgmtName = "cudart64_*.dll"
NvcudaMgmtName = "nvcuda.dll"
NvmlMgmtName = "nvml.dll"
OneapiMgmtName = "ze_intel_gpu64.dll"
)
func GetCPUMem() (memInfo, error) {
memStatus := MEMORYSTATUSEX{length: sizeofMemoryStatusEx}
r1, _, err := globalMemoryStatusExProc.Call(uintptr(unsafe.Pointer(&memStatus)))
@@ -122,27 +99,22 @@ func (pkg *winPackage) IsMember(target *GROUP_AFFINITY) bool {
}
func getLogicalProcessorInformationEx() ([]byte, error) {
buf := make([]byte, 1)
buf := make([]byte, 1024)
bufSize := len(buf)
ret, _, err := GetLogicalProcessorInformationEx.Call(
uintptr(RelationAll),
uintptr(unsafe.Pointer(&buf[0])),
uintptr(unsafe.Pointer(&bufSize)),
)
if ret != 0 {
return nil, fmt.Errorf("failed to determine size info ret:%d %w", ret, err)
var err error
for range 3 {
var ret uintptr
ret, _, err = GetLogicalProcessorInformationEx.Call(
uintptr(RelationAll),
uintptr(unsafe.Pointer(&buf[0])),
uintptr(unsafe.Pointer(&bufSize)),
)
if ret == 1 && bufSize <= len(buf) {
return buf, nil
}
buf = make([]byte, bufSize)
}
buf = make([]byte, bufSize)
ret, _, err = GetLogicalProcessorInformationEx.Call(
uintptr(RelationAll),
uintptr(unsafe.Pointer(&buf[0])),
uintptr(unsafe.Pointer(&bufSize)),
)
if ret == 0 {
return nil, fmt.Errorf("failed to gather processor information ret:%d buflen:%d %w", ret, bufSize, err)
}
return buf, nil
return nil, fmt.Errorf("unable to determine CPU details: %w", err)
}
func processSystemLogicalProcessorInforationList(buf []byte) []*winPackage {
@@ -217,10 +189,11 @@ func processSystemLogicalProcessorInforationList(buf []byte) []*winPackage {
return packages
}
func GetCPUDetails() ([]CPU, error) {
func GetCPUDetails() []CPU {
buf, err := getLogicalProcessorInformationEx()
if err != nil {
return nil, err
slog.Warn("failed to get CPU details", "error", err)
return nil
}
packages := processSystemLogicalProcessorInforationList(buf)
cpus := make([]CPU, len(packages))
@@ -230,5 +203,10 @@ func GetCPUDetails() ([]CPU, error) {
cpus[i].EfficiencyCoreCount = pkg.efficiencyCoreCount
cpus[i].ThreadCount = pkg.threadCount
}
return cpus, nil
return cpus
}
func IsNUMA() bool {
// numa support in ggml is linux only
return false
}

View File

@@ -1,69 +0,0 @@
//go:build linux || windows
package discover
import (
"fmt"
"log/slog"
"os"
"regexp"
"runtime"
"strconv"
"strings"
)
// Jetson devices have JETSON_JETPACK="x.y.z" factory set to the Jetpack version installed.
// Included to drive logic for reducing Ollama-allocated overhead on L4T/Jetson devices.
var CudaTegra string = os.Getenv("JETSON_JETPACK")
func cudaGetVisibleDevicesEnv(gpuInfo []GpuInfo) (string, string) {
ids := []string{}
for _, info := range gpuInfo {
if info.Library != "cuda" {
// TODO shouldn't happen if things are wired correctly...
slog.Debug("cudaGetVisibleDevicesEnv skipping over non-cuda device", "library", info.Library)
continue
}
ids = append(ids, info.ID)
}
return "CUDA_VISIBLE_DEVICES", strings.Join(ids, ",")
}
func cudaVariant(gpuInfo CudaGPUInfo) string {
if runtime.GOARCH == "arm64" && runtime.GOOS == "linux" {
if CudaTegra != "" {
ver := strings.Split(CudaTegra, ".")
if len(ver) > 0 {
return "jetpack" + ver[0]
}
} else if data, err := os.ReadFile("/etc/nv_tegra_release"); err == nil {
r := regexp.MustCompile(` R(\d+) `)
m := r.FindSubmatch(data)
if len(m) != 2 {
slog.Info("Unexpected format for /etc/nv_tegra_release. Set JETSON_JETPACK to select version")
} else {
if l4t, err := strconv.Atoi(string(m[1])); err == nil {
// Note: mapping from L4t -> JP is inconsistent (can't just subtract 30)
// https://developer.nvidia.com/embedded/jetpack-archive
switch l4t {
case 35:
return "jetpack5"
case 36:
return "jetpack6"
default:
slog.Info("unsupported L4T version", "nv_tegra_release", string(data))
}
}
}
}
return "sbsa"
}
// driver 12.0 has problems with the cuda v12 library, so run v11 on those older drivers
if gpuInfo.DriverMajor < 12 || (gpuInfo.DriverMajor == 12 && gpuInfo.DriverMinor == 0) {
// The detected driver is older than Feb 2023
slog.Warn("old CUDA driver detected - please upgrade to a newer driver", "version", fmt.Sprintf("%d.%d", gpuInfo.DriverMajor, gpuInfo.DriverMinor))
return "v11"
}
return "v12"
}

View File

@@ -1,718 +1,73 @@
//go:build linux || windows
package discover
/*
#cgo linux LDFLAGS: -lrt -lpthread -ldl -lstdc++ -lm
#cgo windows LDFLAGS: -lpthread
#include "gpu_info.h"
*/
import "C"
import (
"fmt"
"log/slog"
"os"
"path/filepath"
"regexp"
"runtime"
"strconv"
"strings"
"sync"
"unsafe"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/format"
"github.com/ollama/ollama/ml"
)
type cudaHandles struct {
deviceCount int
cudart *C.cudart_handle_t
nvcuda *C.nvcuda_handle_t
nvml *C.nvml_handle_t
// Jetson devices have JETSON_JETPACK="x.y.z" factory set to the Jetpack version installed.
// Included to drive logic for reducing Ollama-allocated overhead on L4T/Jetson devices.
var CudaTegra string = os.Getenv("JETSON_JETPACK")
// GetSystemInfo returns the last cached state of the GPUs on the system
func GetSystemInfo() ml.SystemInfo {
memInfo, err := GetCPUMem()
if err != nil {
slog.Warn("error looking up system memory", "error", err)
}
var threadCount int
cpus := GetCPUDetails()
for _, c := range cpus {
threadCount += c.CoreCount - c.EfficiencyCoreCount
}
if threadCount == 0 {
// Fall back to Go's num CPU
threadCount = runtime.NumCPU()
}
return ml.SystemInfo{
ThreadCount: threadCount,
TotalMemory: memInfo.TotalMemory,
FreeMemory: memInfo.FreeMemory,
FreeSwap: memInfo.FreeSwap,
}
}
type oneapiHandles struct {
oneapi *C.oneapi_handle_t
deviceCount int
}
const (
cudaMinimumMemory = 457 * format.MebiByte
rocmMinimumMemory = 457 * format.MebiByte
// TODO OneAPI minimum memory
)
var (
gpuMutex sync.Mutex
bootstrapped bool
cpus []CPUInfo
cudaGPUs []CudaGPUInfo
nvcudaLibPath string
cudartLibPath string
oneapiLibPath string
nvmlLibPath string
rocmGPUs []RocmGPUInfo
oneapiGPUs []OneapiGPUInfo
// If any discovered GPUs are incompatible, report why
unsupportedGPUs []UnsupportedGPUInfo
// Keep track of errors during bootstrapping so that if GPUs are missing
// they expected to be present this may explain why
bootstrapErrors []error
)
// With our current CUDA compile flags, older than 5.0 will not work properly
// (string values used to allow ldflags overrides at build time)
var (
CudaComputeMajorMin = "5"
CudaComputeMinorMin = "0"
)
var RocmComputeMajorMin = "9"
// TODO find a better way to detect iGPU instead of minimum memory
const IGPUMemLimit = 1 * format.GibiByte // 512G is what they typically report, so anything less than 1G must be iGPU
// Note: gpuMutex must already be held
func initCudaHandles() *cudaHandles {
// TODO - if the ollama build is CPU only, don't do these checks as they're irrelevant and confusing
cHandles := &cudaHandles{}
// Short Circuit if we already know which library to use
// ignore bootstrap errors in this case since we already recorded them
if nvmlLibPath != "" {
cHandles.nvml, _, _ = loadNVMLMgmt([]string{nvmlLibPath})
return cHandles
}
if nvcudaLibPath != "" {
cHandles.deviceCount, cHandles.nvcuda, _, _ = loadNVCUDAMgmt([]string{nvcudaLibPath})
return cHandles
}
if cudartLibPath != "" {
cHandles.deviceCount, cHandles.cudart, _, _ = loadCUDARTMgmt([]string{cudartLibPath})
return cHandles
}
slog.Debug("searching for GPU discovery libraries for NVIDIA")
var cudartMgmtPatterns []string
// Aligned with driver, we can't carry as payloads
nvcudaMgmtPatterns := NvcudaGlobs
cudartMgmtPatterns = append(cudartMgmtPatterns, filepath.Join(LibOllamaPath, "cuda_v*", CudartMgmtName))
cudartMgmtPatterns = append(cudartMgmtPatterns, CudartGlobs...)
if len(NvmlGlobs) > 0 {
nvmlLibPaths := FindGPULibs(NvmlMgmtName, NvmlGlobs)
if len(nvmlLibPaths) > 0 {
nvml, libPath, err := loadNVMLMgmt(nvmlLibPaths)
if nvml != nil {
slog.Debug("nvidia-ml loaded", "library", libPath)
cHandles.nvml = nvml
nvmlLibPath = libPath
func cudaJetpack() string {
if runtime.GOARCH == "arm64" && runtime.GOOS == "linux" {
if CudaTegra != "" {
ver := strings.Split(CudaTegra, ".")
if len(ver) > 0 {
return "jetpack" + ver[0]
}
if err != nil {
bootstrapErrors = append(bootstrapErrors, err)
}
}
}
nvcudaLibPaths := FindGPULibs(NvcudaMgmtName, nvcudaMgmtPatterns)
if len(nvcudaLibPaths) > 0 {
deviceCount, nvcuda, libPath, err := loadNVCUDAMgmt(nvcudaLibPaths)
if nvcuda != nil {
slog.Debug("detected GPUs", "count", deviceCount, "library", libPath)
cHandles.nvcuda = nvcuda
cHandles.deviceCount = deviceCount
nvcudaLibPath = libPath
return cHandles
}
if err != nil {
bootstrapErrors = append(bootstrapErrors, err)
}
}
cudartLibPaths := FindGPULibs(CudartMgmtName, cudartMgmtPatterns)
if len(cudartLibPaths) > 0 {
deviceCount, cudart, libPath, err := loadCUDARTMgmt(cudartLibPaths)
if cudart != nil {
slog.Debug("detected GPUs", "library", libPath, "count", deviceCount)
cHandles.cudart = cudart
cHandles.deviceCount = deviceCount
cudartLibPath = libPath
return cHandles
}
if err != nil {
bootstrapErrors = append(bootstrapErrors, err)
}
}
return cHandles
}
// Note: gpuMutex must already be held
func initOneAPIHandles() *oneapiHandles {
oHandles := &oneapiHandles{}
// Short Circuit if we already know which library to use
// ignore bootstrap errors in this case since we already recorded them
if oneapiLibPath != "" {
oHandles.deviceCount, oHandles.oneapi, _, _ = loadOneapiMgmt([]string{oneapiLibPath})
return oHandles
}
oneapiLibPaths := FindGPULibs(OneapiMgmtName, OneapiGlobs)
if len(oneapiLibPaths) > 0 {
var err error
oHandles.deviceCount, oHandles.oneapi, oneapiLibPath, err = loadOneapiMgmt(oneapiLibPaths)
if err != nil {
bootstrapErrors = append(bootstrapErrors, err)
}
}
return oHandles
}
func GetCPUInfo() GpuInfoList {
gpuMutex.Lock()
if !bootstrapped {
gpuMutex.Unlock()
GetGPUInfo()
} else {
gpuMutex.Unlock()
}
return GpuInfoList{cpus[0].GpuInfo}
}
func GetGPUInfo() GpuInfoList {
// TODO - consider exploring lspci (and equivalent on windows) to check for
// GPUs so we can report warnings if we see Nvidia/AMD but fail to load the libraries
gpuMutex.Lock()
defer gpuMutex.Unlock()
needRefresh := true
var cHandles *cudaHandles
var oHandles *oneapiHandles
defer func() {
if cHandles != nil {
if cHandles.cudart != nil {
C.cudart_release(*cHandles.cudart)
}
if cHandles.nvcuda != nil {
C.nvcuda_release(*cHandles.nvcuda)
}
if cHandles.nvml != nil {
C.nvml_release(*cHandles.nvml)
}
}
if oHandles != nil {
if oHandles.oneapi != nil {
// TODO - is this needed?
C.oneapi_release(*oHandles.oneapi)
}
}
}()
if !bootstrapped {
slog.Info("looking for compatible GPUs")
cudaComputeMajorMin, err := strconv.Atoi(CudaComputeMajorMin)
if err != nil {
slog.Error("invalid CudaComputeMajorMin setting", "value", CudaComputeMajorMin, "error", err)
}
cudaComputeMinorMin, err := strconv.Atoi(CudaComputeMinorMin)
if err != nil {
slog.Error("invalid CudaComputeMinorMin setting", "value", CudaComputeMinorMin, "error", err)
}
bootstrapErrors = []error{}
needRefresh = false
var memInfo C.mem_info_t
mem, err := GetCPUMem()
if err != nil {
slog.Warn("error looking up system memory", "error", err)
}
details, err := GetCPUDetails()
if err != nil {
slog.Warn("failed to lookup CPU details", "error", err)
}
cpus = []CPUInfo{
{
GpuInfo: GpuInfo{
memInfo: mem,
Library: "cpu",
ID: "0",
},
CPUs: details,
},
}
// Load ALL libraries
cHandles = initCudaHandles()
// NVIDIA
for i := range cHandles.deviceCount {
if cHandles.cudart != nil || cHandles.nvcuda != nil {
gpuInfo := CudaGPUInfo{
GpuInfo: GpuInfo{
Library: "cuda",
},
index: i,
}
var driverMajor int
var driverMinor int
if cHandles.cudart != nil {
C.cudart_bootstrap(*cHandles.cudart, C.int(i), &memInfo)
} else {
C.nvcuda_bootstrap(*cHandles.nvcuda, C.int(i), &memInfo)
driverMajor = int(cHandles.nvcuda.driver_major)
driverMinor = int(cHandles.nvcuda.driver_minor)
}
if memInfo.err != nil {
slog.Info("error looking up nvidia GPU memory", "error", C.GoString(memInfo.err))
C.free(unsafe.Pointer(memInfo.err))
continue
}
gpuInfo.TotalMemory = uint64(memInfo.total)
gpuInfo.FreeMemory = uint64(memInfo.free)
gpuInfo.ID = C.GoString(&memInfo.gpu_id[0])
gpuInfo.Compute = fmt.Sprintf("%d.%d", memInfo.major, memInfo.minor)
gpuInfo.computeMajor = int(memInfo.major)
gpuInfo.computeMinor = int(memInfo.minor)
gpuInfo.MinimumMemory = cudaMinimumMemory
gpuInfo.DriverMajor = driverMajor
gpuInfo.DriverMinor = driverMinor
variant := cudaVariant(gpuInfo)
// Start with our bundled libraries
if variant != "" {
variantPath := filepath.Join(LibOllamaPath, "cuda_"+variant)
if _, err := os.Stat(variantPath); err == nil {
// Put the variant directory first in the search path to avoid runtime linking to the wrong library
gpuInfo.DependencyPath = append([]string{variantPath}, gpuInfo.DependencyPath...)
}
}
gpuInfo.Name = C.GoString(&memInfo.gpu_name[0])
gpuInfo.Variant = variant
if int(memInfo.major) < cudaComputeMajorMin || (int(memInfo.major) == cudaComputeMajorMin && int(memInfo.minor) < cudaComputeMinorMin) {
unsupportedGPUs = append(unsupportedGPUs,
UnsupportedGPUInfo{
GpuInfo: gpuInfo.GpuInfo,
})
slog.Info(fmt.Sprintf("[%d] CUDA GPU is too old. Compute Capability detected: %d.%d", i, memInfo.major, memInfo.minor))
continue
}
// query the management library as well so we can record any skew between the two
// which represents overhead on the GPU we must set aside on subsequent updates
if cHandles.nvml != nil {
uuid := C.CString(gpuInfo.ID)
defer C.free(unsafe.Pointer(uuid))
C.nvml_get_free(*cHandles.nvml, uuid, &memInfo.free, &memInfo.total, &memInfo.used)
if memInfo.err != nil {
slog.Warn("error looking up nvidia GPU memory", "error", C.GoString(memInfo.err))
C.free(unsafe.Pointer(memInfo.err))
} else {
if memInfo.free != 0 && uint64(memInfo.free) > gpuInfo.FreeMemory {
gpuInfo.OSOverhead = uint64(memInfo.free) - gpuInfo.FreeMemory
slog.Info("detected OS VRAM overhead",
"id", gpuInfo.ID,
"library", gpuInfo.Library,
"compute", gpuInfo.Compute,
"driver", fmt.Sprintf("%d.%d", gpuInfo.DriverMajor, gpuInfo.DriverMinor),
"name", gpuInfo.Name,
"overhead", format.HumanBytes2(gpuInfo.OSOverhead),
)
}
}
}
// TODO potentially sort on our own algorithm instead of what the underlying GPU library does...
cudaGPUs = append(cudaGPUs, gpuInfo)
}
}
// Intel
if envconfig.IntelGPU() {
oHandles = initOneAPIHandles()
if oHandles != nil && oHandles.oneapi != nil {
for d := range oHandles.oneapi.num_drivers {
if oHandles.oneapi == nil {
// shouldn't happen
slog.Warn("nil oneapi handle with driver count", "count", int(oHandles.oneapi.num_drivers))
continue
}
devCount := C.oneapi_get_device_count(*oHandles.oneapi, C.int(d))
for i := range devCount {
gpuInfo := OneapiGPUInfo{
GpuInfo: GpuInfo{
Library: "oneapi",
},
driverIndex: int(d),
gpuIndex: int(i),
}
// TODO - split bootstrapping from updating free memory
C.oneapi_check_vram(*oHandles.oneapi, C.int(d), i, &memInfo)
// TODO - convert this to MinimumMemory based on testing...
var totalFreeMem float64 = float64(memInfo.free) * 0.95 // work-around: leave some reserve vram for mkl lib used in ggml-sycl backend.
memInfo.free = C.uint64_t(totalFreeMem)
gpuInfo.TotalMemory = uint64(memInfo.total)
gpuInfo.FreeMemory = uint64(memInfo.free)
gpuInfo.ID = C.GoString(&memInfo.gpu_id[0])
gpuInfo.Name = C.GoString(&memInfo.gpu_name[0])
gpuInfo.DependencyPath = []string{LibOllamaPath}
oneapiGPUs = append(oneapiGPUs, gpuInfo)
}
}
}
}
rocmGPUs, err = AMDGetGPUInfo()
if err != nil {
bootstrapErrors = append(bootstrapErrors, err)
}
bootstrapped = true
if len(cudaGPUs) == 0 && len(rocmGPUs) == 0 && len(oneapiGPUs) == 0 {
slog.Info("no compatible GPUs were discovered")
}
// TODO verify we have runners for the discovered GPUs, filter out any that aren't supported with good error messages
}
// For detected GPUs, load library if not loaded
// Refresh free memory usage
if needRefresh {
mem, err := GetCPUMem()
if err != nil {
slog.Warn("error looking up system memory", "error", err)
} else {
slog.Debug("updating system memory data",
slog.Group(
"before",
"total", format.HumanBytes2(cpus[0].TotalMemory),
"free", format.HumanBytes2(cpus[0].FreeMemory),
"free_swap", format.HumanBytes2(cpus[0].FreeSwap),
),
slog.Group(
"now",
"total", format.HumanBytes2(mem.TotalMemory),
"free", format.HumanBytes2(mem.FreeMemory),
"free_swap", format.HumanBytes2(mem.FreeSwap),
),
)
cpus[0].FreeMemory = mem.FreeMemory
cpus[0].FreeSwap = mem.FreeSwap
}
var memInfo C.mem_info_t
if cHandles == nil && len(cudaGPUs) > 0 {
cHandles = initCudaHandles()
}
for i, gpu := range cudaGPUs {
if cHandles.nvml != nil {
uuid := C.CString(gpu.ID)
defer C.free(unsafe.Pointer(uuid))
C.nvml_get_free(*cHandles.nvml, uuid, &memInfo.free, &memInfo.total, &memInfo.used)
} else if cHandles.cudart != nil {
C.cudart_bootstrap(*cHandles.cudart, C.int(gpu.index), &memInfo)
} else if cHandles.nvcuda != nil {
C.nvcuda_get_free(*cHandles.nvcuda, C.int(gpu.index), &memInfo.free, &memInfo.total)
memInfo.used = memInfo.total - memInfo.free
} else if data, err := os.ReadFile("/etc/nv_tegra_release"); err == nil {
r := regexp.MustCompile(` R(\d+) `)
m := r.FindSubmatch(data)
if len(m) != 2 {
slog.Info("Unexpected format for /etc/nv_tegra_release. Set JETSON_JETPACK to select version")
} else {
// shouldn't happen
slog.Warn("no valid cuda library loaded to refresh vram usage")
break
}
if memInfo.err != nil {
slog.Warn("error looking up nvidia GPU memory", "error", C.GoString(memInfo.err))
C.free(unsafe.Pointer(memInfo.err))
continue
}
if memInfo.free == 0 {
slog.Warn("error looking up nvidia GPU memory")
continue
}
if cHandles.nvml != nil && gpu.OSOverhead > 0 {
// When using the management library update based on recorded overhead
memInfo.free -= C.uint64_t(gpu.OSOverhead)
}
slog.Debug("updating cuda memory data",
"gpu", gpu.ID,
"name", gpu.Name,
"overhead", format.HumanBytes2(gpu.OSOverhead),
slog.Group(
"before",
"total", format.HumanBytes2(gpu.TotalMemory),
"free", format.HumanBytes2(gpu.FreeMemory),
),
slog.Group(
"now",
"total", format.HumanBytes2(uint64(memInfo.total)),
"free", format.HumanBytes2(uint64(memInfo.free)),
"used", format.HumanBytes2(uint64(memInfo.used)),
),
)
cudaGPUs[i].FreeMemory = uint64(memInfo.free)
}
if oHandles == nil && len(oneapiGPUs) > 0 {
oHandles = initOneAPIHandles()
}
for i, gpu := range oneapiGPUs {
if oHandles.oneapi == nil {
// shouldn't happen
slog.Warn("nil oneapi handle with device count", "count", oHandles.deviceCount)
continue
}
C.oneapi_check_vram(*oHandles.oneapi, C.int(gpu.driverIndex), C.int(gpu.gpuIndex), &memInfo)
// TODO - convert this to MinimumMemory based on testing...
var totalFreeMem float64 = float64(memInfo.free) * 0.95 // work-around: leave some reserve vram for mkl lib used in ggml-sycl backend.
memInfo.free = C.uint64_t(totalFreeMem)
oneapiGPUs[i].FreeMemory = uint64(memInfo.free)
}
err = RocmGPUInfoList(rocmGPUs).RefreshFreeMemory()
if err != nil {
slog.Debug("problem refreshing ROCm free memory", "error", err)
}
}
resp := []GpuInfo{}
for _, gpu := range cudaGPUs {
resp = append(resp, gpu.GpuInfo)
}
for _, gpu := range rocmGPUs {
resp = append(resp, gpu.GpuInfo)
}
for _, gpu := range oneapiGPUs {
resp = append(resp, gpu.GpuInfo)
}
if len(resp) == 0 {
resp = append(resp, cpus[0].GpuInfo)
}
return resp
}
func FindGPULibs(baseLibName string, defaultPatterns []string) []string {
// Multiple GPU libraries may exist, and some may not work, so keep trying until we exhaust them
gpuLibPaths := []string{}
slog.Debug("Searching for GPU library", "name", baseLibName)
// search our bundled libraries first
patterns := []string{filepath.Join(LibOllamaPath, baseLibName)}
var ldPaths []string
switch runtime.GOOS {
case "windows":
ldPaths = strings.Split(os.Getenv("PATH"), string(os.PathListSeparator))
case "linux":
ldPaths = strings.Split(os.Getenv("LD_LIBRARY_PATH"), string(os.PathListSeparator))
}
// then search the system's LD_LIBRARY_PATH
for _, p := range ldPaths {
p, err := filepath.Abs(p)
if err != nil {
continue
}
patterns = append(patterns, filepath.Join(p, baseLibName))
}
// finally, search the default patterns provided by the caller
patterns = append(patterns, defaultPatterns...)
slog.Debug("gpu library search", "globs", patterns)
for _, pattern := range patterns {
// Nvidia PhysX known to return bogus results
if strings.Contains(pattern, "PhysX") {
slog.Debug("skipping PhysX cuda library path", "path", pattern)
continue
}
// Ignore glob discovery errors
matches, _ := filepath.Glob(pattern)
for _, match := range matches {
// Resolve any links so we don't try the same lib multiple times
// and weed out any dups across globs
libPath := match
tmp := match
var err error
for ; err == nil; tmp, err = os.Readlink(libPath) {
if !filepath.IsAbs(tmp) {
tmp = filepath.Join(filepath.Dir(libPath), tmp)
}
libPath = tmp
}
new := true
for _, cmp := range gpuLibPaths {
if cmp == libPath {
new = false
break
if l4t, err := strconv.Atoi(string(m[1])); err == nil {
// Note: mapping from L4t -> JP is inconsistent (can't just subtract 30)
// https://developer.nvidia.com/embedded/jetpack-archive
switch l4t {
case 35:
return "jetpack5"
case 36:
return "jetpack6"
default:
// Newer Jetson systems use the SBSU runtime
slog.Debug("unrecognized L4T version", "nv_tegra_release", string(data))
}
}
}
if new {
gpuLibPaths = append(gpuLibPaths, libPath)
}
}
}
slog.Debug("discovered GPU libraries", "paths", gpuLibPaths)
return gpuLibPaths
}
// Bootstrap the runtime library
// Returns: num devices, handle, libPath, error
func loadCUDARTMgmt(cudartLibPaths []string) (int, *C.cudart_handle_t, string, error) {
var resp C.cudart_init_resp_t
resp.ch.verbose = getVerboseState()
var err error
for _, libPath := range cudartLibPaths {
lib := C.CString(libPath)
defer C.free(unsafe.Pointer(lib))
C.cudart_init(lib, &resp)
if resp.err != nil {
err = fmt.Errorf("Unable to load cudart library %s: %s", libPath, C.GoString(resp.err))
slog.Debug(err.Error())
C.free(unsafe.Pointer(resp.err))
} else {
err = nil
return int(resp.num_devices), &resp.ch, libPath, err
}
}
return 0, nil, "", err
}
// Bootstrap the driver library
// Returns: num devices, handle, libPath, error
func loadNVCUDAMgmt(nvcudaLibPaths []string) (int, *C.nvcuda_handle_t, string, error) {
var resp C.nvcuda_init_resp_t
resp.ch.verbose = getVerboseState()
var err error
for _, libPath := range nvcudaLibPaths {
lib := C.CString(libPath)
defer C.free(unsafe.Pointer(lib))
C.nvcuda_init(lib, &resp)
if resp.err != nil {
// Decide what log level based on the type of error message to help users understand why
switch resp.cudaErr {
case C.CUDA_ERROR_INSUFFICIENT_DRIVER, C.CUDA_ERROR_SYSTEM_DRIVER_MISMATCH:
err = fmt.Errorf("version mismatch between driver and cuda driver library - reboot or upgrade may be required: library %s", libPath)
slog.Warn(err.Error())
case C.CUDA_ERROR_NO_DEVICE:
err = fmt.Errorf("no nvidia devices detected by library %s", libPath)
slog.Info(err.Error())
case C.CUDA_ERROR_UNKNOWN:
err = fmt.Errorf("unknown error initializing cuda driver library %s: %s. see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md for more information", libPath, C.GoString(resp.err))
slog.Warn(err.Error())
default:
msg := C.GoString(resp.err)
if strings.Contains(msg, "wrong ELF class") {
slog.Debug("skipping 32bit library", "library", libPath)
} else {
err = fmt.Errorf("Unable to load cudart library %s: %s", libPath, C.GoString(resp.err))
slog.Info(err.Error())
}
}
C.free(unsafe.Pointer(resp.err))
} else {
err = nil
return int(resp.num_devices), &resp.ch, libPath, err
}
}
return 0, nil, "", err
}
// Bootstrap the management library
// Returns: handle, libPath, error
func loadNVMLMgmt(nvmlLibPaths []string) (*C.nvml_handle_t, string, error) {
var resp C.nvml_init_resp_t
resp.ch.verbose = getVerboseState()
var err error
for _, libPath := range nvmlLibPaths {
lib := C.CString(libPath)
defer C.free(unsafe.Pointer(lib))
C.nvml_init(lib, &resp)
if resp.err != nil {
err = fmt.Errorf("Unable to load NVML management library %s: %s", libPath, C.GoString(resp.err))
slog.Info(err.Error())
C.free(unsafe.Pointer(resp.err))
} else {
err = nil
return &resp.ch, libPath, err
}
}
return nil, "", err
}
// bootstrap the Intel GPU library
// Returns: num devices, handle, libPath, error
func loadOneapiMgmt(oneapiLibPaths []string) (int, *C.oneapi_handle_t, string, error) {
var resp C.oneapi_init_resp_t
num_devices := 0
resp.oh.verbose = getVerboseState()
var err error
for _, libPath := range oneapiLibPaths {
lib := C.CString(libPath)
defer C.free(unsafe.Pointer(lib))
C.oneapi_init(lib, &resp)
if resp.err != nil {
err = fmt.Errorf("Unable to load oneAPI management library %s: %s", libPath, C.GoString(resp.err))
slog.Debug(err.Error())
C.free(unsafe.Pointer(resp.err))
} else {
err = nil
for i := range resp.oh.num_drivers {
num_devices += int(C.oneapi_get_device_count(resp.oh, C.int(i)))
}
return num_devices, &resp.oh, libPath, err
}
}
return 0, nil, "", err
}
func getVerboseState() C.uint16_t {
if envconfig.LogLevel() < slog.LevelInfo {
return C.uint16_t(1)
}
return C.uint16_t(0)
}
// Given the list of GPUs this instantiation is targeted for,
// figure out the visible devices environment variable
//
// If different libraries are detected, the first one is what we use
func (l GpuInfoList) GetVisibleDevicesEnv() (string, string) {
if len(l) == 0 {
return "", ""
}
switch l[0].Library {
case "cuda":
return cudaGetVisibleDevicesEnv(l)
case "rocm":
return rocmGetVisibleDevicesEnv(l)
case "oneapi":
return oneapiGetVisibleDevicesEnv(l)
default:
slog.Debug("no filter required for library " + l[0].Library)
return "", ""
}
}
func GetSystemInfo() SystemInfo {
gpus := GetGPUInfo()
gpuMutex.Lock()
defer gpuMutex.Unlock()
discoveryErrors := []string{}
for _, err := range bootstrapErrors {
discoveryErrors = append(discoveryErrors, err.Error())
}
if len(gpus) == 1 && gpus[0].Library == "cpu" {
gpus = []GpuInfo{}
}
return SystemInfo{
System: cpus[0],
GPUs: gpus,
UnsupportedGPUs: unsupportedGPUs,
DiscoveryErrors: discoveryErrors,
}
return ""
}

View File

@@ -1,5 +1,3 @@
//go:build darwin
package discover
/*
@@ -11,7 +9,6 @@ import "C"
import (
"log/slog"
"runtime"
"syscall"
"github.com/ollama/ollama/format"
@@ -21,39 +18,6 @@ const (
metalMinimumMemory = 512 * format.MebiByte
)
func GetGPUInfo() GpuInfoList {
mem, _ := GetCPUMem()
if runtime.GOARCH == "amd64" {
return []GpuInfo{
{
Library: "cpu",
memInfo: mem,
},
}
}
info := GpuInfo{
Library: "metal",
ID: "0",
}
info.TotalMemory = uint64(C.getRecommendedMaxVRAM())
// TODO is there a way to gather actual allocated video memory? (currentAllocatedSize doesn't work)
info.FreeMemory = info.TotalMemory
info.MinimumMemory = metalMinimumMemory
return []GpuInfo{info}
}
func GetCPUInfo() GpuInfoList {
mem, _ := GetCPUMem()
return []GpuInfo{
{
Library: "cpu",
memInfo: mem,
},
}
}
func GetCPUMem() (memInfo, error) {
return memInfo{
TotalMemory: uint64(C.getPhysicalMemory()),
@@ -62,13 +26,7 @@ func GetCPUMem() (memInfo, error) {
}, nil
}
func (l GpuInfoList) GetVisibleDevicesEnv() (string, string) {
// No-op on darwin
return "", ""
}
func GetSystemInfo() SystemInfo {
mem, _ := GetCPUMem()
func GetCPUDetails() []CPU {
query := "hw.perflevel0.physicalcpu"
perfCores, err := syscall.SysctlUint32(query)
if err != nil {
@@ -81,19 +39,16 @@ func GetSystemInfo() SystemInfo {
query = "hw.logicalcpu"
logicalCores, _ := syscall.SysctlUint32(query)
return SystemInfo{
System: CPUInfo{
GpuInfo: GpuInfo{
memInfo: mem,
},
CPUs: []CPU{
{
CoreCount: int(perfCores + efficiencyCores),
EfficiencyCoreCount: int(efficiencyCores),
ThreadCount: int(logicalCores),
},
},
return []CPU{
{
CoreCount: int(perfCores + efficiencyCores),
EfficiencyCoreCount: int(efficiencyCores),
ThreadCount: int(logicalCores),
},
GPUs: GetGPUInfo(),
}
}
func IsNUMA() bool {
// numa support in ggml is linux only
return false
}

View File

@@ -1,72 +0,0 @@
#ifndef __APPLE__
#ifndef __GPU_INFO_H__
#define __GPU_INFO_H__
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#ifndef _WIN32
#include <dlfcn.h>
#define LOAD_LIBRARY(lib, flags) dlopen(lib, flags)
#define LOAD_SYMBOL(handle, sym) dlsym(handle, sym)
#define LOAD_ERR() strdup(dlerror())
#define UNLOAD_LIBRARY(handle) dlclose(handle)
#else
#include <windows.h>
#define LOAD_LIBRARY(lib, flags) LoadLibrary(lib)
#define LOAD_SYMBOL(handle, sym) GetProcAddress(handle, sym)
#define UNLOAD_LIBRARY(handle) FreeLibrary(handle)
#define LOAD_ERR() ({\
LPSTR messageBuffer = NULL; \
size_t size = FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS, \
NULL, GetLastError(), MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPSTR)&messageBuffer, 0, NULL); \
char *resp = strdup(messageBuffer); \
LocalFree(messageBuffer); \
resp; \
})
#endif
#ifndef LOG
#define LOG(verbose, ...) \
do { \
if (verbose) { \
fprintf(stderr, __VA_ARGS__); \
} \
} while (0)
#endif
#ifdef __cplusplus
extern "C" {
#endif
#define GPU_ID_LEN 64
#define GPU_NAME_LEN 96
typedef struct mem_info {
char *err; // If non-nill, caller responsible for freeing
char gpu_id[GPU_ID_LEN];
char gpu_name[GPU_NAME_LEN];
uint64_t total;
uint64_t free;
uint64_t used;
// Compute Capability
int major;
int minor;
int patch;
} mem_info_t;
void cpu_check_ram(mem_info_t *resp);
#ifdef __cplusplus
}
#endif
#include "gpu_info_cudart.h"
#include "gpu_info_nvcuda.h"
#include "gpu_info_nvml.h"
#include "gpu_info_oneapi.h"
#endif // __GPU_INFO_H__
#endif // __APPLE__

View File

@@ -1,184 +0,0 @@
#ifndef __APPLE__ // TODO - maybe consider nvidia support on intel macs?
#include <string.h>
#include <inttypes.h>
#include "gpu_info_cudart.h"
void cudart_init(char *cudart_lib_path, cudart_init_resp_t *resp) {
cudartReturn_t ret;
resp->err = NULL;
resp->num_devices = 0;
const int buflen = 256;
char buf[buflen + 1];
int i;
struct lookup {
char *s;
void **p;
} l[] = {
{"cudaSetDevice", (void *)&resp->ch.cudaSetDevice},
{"cudaDeviceSynchronize", (void *)&resp->ch.cudaDeviceSynchronize},
{"cudaDeviceReset", (void *)&resp->ch.cudaDeviceReset},
{"cudaMemGetInfo", (void *)&resp->ch.cudaMemGetInfo},
{"cudaGetDeviceCount", (void *)&resp->ch.cudaGetDeviceCount},
{"cudaDeviceGetAttribute", (void *)&resp->ch.cudaDeviceGetAttribute},
{"cudaDriverGetVersion", (void *)&resp->ch.cudaDriverGetVersion},
{"cudaGetDeviceProperties", (void *)&resp->ch.cudaGetDeviceProperties},
{NULL, NULL},
};
resp->ch.handle = LOAD_LIBRARY(cudart_lib_path, RTLD_LAZY);
if (!resp->ch.handle) {
char *msg = LOAD_ERR();
LOG(resp->ch.verbose, "library %s load err: %s\n", cudart_lib_path, msg);
snprintf(buf, buflen,
"Unable to load %s library to query for Nvidia GPUs: %s",
cudart_lib_path, msg);
free(msg);
resp->err = strdup(buf);
return;
}
for (i = 0; l[i].s != NULL; i++) {
*l[i].p = LOAD_SYMBOL(resp->ch.handle, l[i].s);
if (!*(l[i].p)) {
char *msg = LOAD_ERR();
LOG(resp->ch.verbose, "dlerr: %s\n", msg);
UNLOAD_LIBRARY(resp->ch.handle);
resp->ch.handle = NULL;
snprintf(buf, buflen, "symbol lookup for %s failed: %s", l[i].s,
msg);
free(msg);
resp->err = strdup(buf);
return;
}
}
ret = (*resp->ch.cudaSetDevice)(0);
if (ret != CUDART_SUCCESS) {
LOG(resp->ch.verbose, "cudaSetDevice err: %d\n", ret);
UNLOAD_LIBRARY(resp->ch.handle);
resp->ch.handle = NULL;
if (ret == CUDART_ERROR_INSUFFICIENT_DRIVER) {
resp->err = strdup("your nvidia driver is too old or missing. If you have a CUDA GPU please upgrade to run ollama");
return;
}
snprintf(buf, buflen, "cudart init failure: %d", ret);
resp->err = strdup(buf);
return;
}
int version = 0;
cudartDriverVersion_t driverVersion;
driverVersion.major = 0;
driverVersion.minor = 0;
// Report driver version if we're in verbose mode, ignore errors
ret = (*resp->ch.cudaDriverGetVersion)(&version);
if (ret != CUDART_SUCCESS) {
LOG(resp->ch.verbose, "cudaDriverGetVersion failed: %d\n", ret);
} else {
driverVersion.major = version / 1000;
driverVersion.minor = (version - (driverVersion.major * 1000)) / 10;
LOG(resp->ch.verbose, "CUDA driver version: %d-%d\n", driverVersion.major, driverVersion.minor);
}
ret = (*resp->ch.cudaGetDeviceCount)(&resp->num_devices);
if (ret != CUDART_SUCCESS) {
LOG(resp->ch.verbose, "cudaGetDeviceCount err: %d\n", ret);
UNLOAD_LIBRARY(resp->ch.handle);
resp->ch.handle = NULL;
snprintf(buf, buflen, "unable to get device count: %d", ret);
resp->err = strdup(buf);
return;
}
}
void cudart_bootstrap(cudart_handle_t h, int i, mem_info_t *resp) {
resp->err = NULL;
cudartMemory_t memInfo = {0,0,0};
cudartReturn_t ret;
const int buflen = 256;
char buf[buflen + 1];
if (h.handle == NULL) {
resp->err = strdup("cudart handle isn't initialized");
return;
}
ret = (*h.cudaSetDevice)(i);
if (ret != CUDART_SUCCESS) {
snprintf(buf, buflen, "cudart device failed to initialize");
resp->err = strdup(buf);
return;
}
cudaDeviceProp_t props;
ret = (*h.cudaGetDeviceProperties)(&props, i);
if (ret != CUDART_SUCCESS) {
LOG(h.verbose, "[%d] device properties lookup failure: %d\n", i, ret);
snprintf(&resp->gpu_id[0], GPU_ID_LEN, "%d", i);
resp->major = 0;
resp->minor = 0;
} else {
int allNull = 1;
for (int j = 0; j < 16; j++) {
if (props.uuid.bytes[j] != 0) {
allNull = 0;
break;
}
}
if (allNull != 0) {
snprintf(&resp->gpu_id[0], GPU_ID_LEN, "%d", i);
} else {
// GPU-d110a105-ac29-1d54-7b49-9c90440f215b
snprintf(&resp->gpu_id[0], GPU_ID_LEN,
"GPU-%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x",
props.uuid.bytes[0],
props.uuid.bytes[1],
props.uuid.bytes[2],
props.uuid.bytes[3],
props.uuid.bytes[4],
props.uuid.bytes[5],
props.uuid.bytes[6],
props.uuid.bytes[7],
props.uuid.bytes[8],
props.uuid.bytes[9],
props.uuid.bytes[10],
props.uuid.bytes[11],
props.uuid.bytes[12],
props.uuid.bytes[13],
props.uuid.bytes[14],
props.uuid.bytes[15]
);
}
resp->major = props.major;
resp->minor = props.minor;
// TODO add other useful properties from props
}
ret = (*h.cudaMemGetInfo)(&memInfo.free, &memInfo.total);
if (ret != CUDART_SUCCESS) {
snprintf(buf, buflen, "cudart device memory info lookup failure %d", ret);
resp->err = strdup(buf);
return;
}
resp->total = memInfo.total;
resp->free = memInfo.free;
resp->used = memInfo.used;
LOG(h.verbose, "[%s] CUDA totalMem %" PRId64 "\n", resp->gpu_id, resp->total);
LOG(h.verbose, "[%s] CUDA freeMem %" PRId64 "\n", resp->gpu_id, resp->free);
LOG(h.verbose, "[%s] CUDA usedMem %" PRId64 "\n", resp->gpu_id, resp->used);
LOG(h.verbose, "[%s] Compute Capability %d.%d\n", resp->gpu_id, resp->major, resp->minor);
}
void cudart_release(cudart_handle_t h) {
LOG(h.verbose, "releasing cudart library\n");
UNLOAD_LIBRARY(h.handle);
h.handle = NULL;
}
#endif // __APPLE__

View File

@@ -1,148 +0,0 @@
#ifndef __APPLE__
#ifndef __GPU_INFO_CUDART_H__
#define __GPU_INFO_CUDART_H__
#include "gpu_info.h"
// Just enough typedef's to dlopen/dlsym for memory information
typedef enum cudartReturn_enum {
CUDART_SUCCESS = 0,
CUDART_ERROR_INVALID_VALUE = 1,
CUDART_ERROR_MEMORY_ALLOCATION = 2,
CUDART_ERROR_INSUFFICIENT_DRIVER = 35,
// Other values omitted for now...
} cudartReturn_t;
typedef enum cudartDeviceAttr_enum {
cudartDevAttrComputeCapabilityMajor = 75,
cudartDevAttrComputeCapabilityMinor = 76,
// TODO - not yet wired up but may be useful for Jetson or other
// integrated GPU scenarios with shared memory
cudaDevAttrIntegrated = 18
} cudartDeviceAttr_t;
typedef void *cudartDevice_t; // Opaque is sufficient
typedef struct cudartMemory_st {
size_t total;
size_t free;
size_t used;
} cudartMemory_t;
typedef struct cudartDriverVersion {
int major;
int minor;
} cudartDriverVersion_t;
typedef struct cudaUUID {
unsigned char bytes[16];
} cudaUUID_t;
typedef struct cudaDeviceProp {
char name[256]; /**< ASCII string identifying device */
cudaUUID_t uuid; /**< 16-byte unique identifier */
char luid[8]; /**< 8-byte locally unique identifier. Value is undefined on TCC and non-Windows platforms */
unsigned int luidDeviceNodeMask; /**< LUID device node mask. Value is undefined on TCC and non-Windows platforms */
size_t totalGlobalMem; /**< Global memory available on device in bytes */
size_t sharedMemPerBlock; /**< Shared memory available per block in bytes */
int regsPerBlock; /**< 32-bit registers available per block */
int warpSize; /**< Warp size in threads */
size_t memPitch; /**< Maximum pitch in bytes allowed by memory copies */
int maxThreadsPerBlock; /**< Maximum number of threads per block */
int maxThreadsDim[3]; /**< Maximum size of each dimension of a block */
int maxGridSize[3]; /**< Maximum size of each dimension of a grid */
int clockRate; /**< Clock frequency in kilohertz */
size_t totalConstMem; /**< Constant memory available on device in bytes */
int major; /**< Major compute capability */
int minor; /**< Minor compute capability */
size_t textureAlignment; /**< Alignment requirement for textures */
size_t texturePitchAlignment; /**< Pitch alignment requirement for texture references bound to pitched memory */
int deviceOverlap; /**< Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */
int multiProcessorCount; /**< Number of multiprocessors on device */
int kernelExecTimeoutEnabled; /**< Specified whether there is a run time limit on kernels */
int integrated; /**< Device is integrated as opposed to discrete */
int canMapHostMemory; /**< Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */
int computeMode; /**< Compute mode (See ::cudaComputeMode) */
int maxTexture1D; /**< Maximum 1D texture size */
int maxTexture1DMipmap; /**< Maximum 1D mipmapped texture size */
int maxTexture1DLinear; /**< Deprecated, do not use. Use cudaDeviceGetTexture1DLinearMaxWidth() or cuDeviceGetTexture1DLinearMaxWidth() instead. */
int maxTexture2D[2]; /**< Maximum 2D texture dimensions */
int maxTexture2DMipmap[2]; /**< Maximum 2D mipmapped texture dimensions */
int maxTexture2DLinear[3]; /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */
int maxTexture2DGather[2]; /**< Maximum 2D texture dimensions if texture gather operations have to be performed */
int maxTexture3D[3]; /**< Maximum 3D texture dimensions */
int maxTexture3DAlt[3]; /**< Maximum alternate 3D texture dimensions */
int maxTextureCubemap; /**< Maximum Cubemap texture dimensions */
int maxTexture1DLayered[2]; /**< Maximum 1D layered texture dimensions */
int maxTexture2DLayered[3]; /**< Maximum 2D layered texture dimensions */
int maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */
int maxSurface1D; /**< Maximum 1D surface size */
int maxSurface2D[2]; /**< Maximum 2D surface dimensions */
int maxSurface3D[3]; /**< Maximum 3D surface dimensions */
int maxSurface1DLayered[2]; /**< Maximum 1D layered surface dimensions */
int maxSurface2DLayered[3]; /**< Maximum 2D layered surface dimensions */
int maxSurfaceCubemap; /**< Maximum Cubemap surface dimensions */
int maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */
size_t surfaceAlignment; /**< Alignment requirements for surfaces */
int concurrentKernels; /**< Device can possibly execute multiple kernels concurrently */
int ECCEnabled; /**< Device has ECC support enabled */
int pciBusID; /**< PCI bus ID of the device */
int pciDeviceID; /**< PCI device ID of the device */
int pciDomainID; /**< PCI domain ID of the device */
int tccDriver; /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */
int asyncEngineCount; /**< Number of asynchronous engines */
int unifiedAddressing; /**< Device shares a unified address space with the host */
int memoryClockRate; /**< Peak memory clock frequency in kilohertz */
int memoryBusWidth; /**< Global memory bus width in bits */
int l2CacheSize; /**< Size of L2 cache in bytes */
int persistingL2CacheMaxSize; /**< Device's maximum l2 persisting lines capacity setting in bytes */
int maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */
int streamPrioritiesSupported; /**< Device supports stream priorities */
int globalL1CacheSupported; /**< Device supports caching globals in L1 */
int localL1CacheSupported; /**< Device supports caching locals in L1 */
size_t sharedMemPerMultiprocessor; /**< Shared memory available per multiprocessor in bytes */
int regsPerMultiprocessor; /**< 32-bit registers available per multiprocessor */
int managedMemory; /**< Device supports allocating managed memory on this system */
int isMultiGpuBoard; /**< Device is on a multi-GPU board */
int multiGpuBoardGroupID; /**< Unique identifier for a group of devices on the same multi-GPU board */
int hostNativeAtomicSupported; /**< Link between the device and the host supports native atomic operations */
int singleToDoublePrecisionPerfRatio; /**< Ratio of single precision performance (in floating-point operations per second) to double precision performance */
int pageableMemoryAccess; /**< Device supports coherently accessing pageable memory without calling cudaHostRegister on it */
int concurrentManagedAccess; /**< Device can coherently access managed memory concurrently with the CPU */
int computePreemptionSupported; /**< Device supports Compute Preemption */
int canUseHostPointerForRegisteredMem; /**< Device can access host registered memory at the same virtual address as the CPU */
int cooperativeLaunch; /**< Device supports launching cooperative kernels via ::cudaLaunchCooperativeKernel */
int cooperativeMultiDeviceLaunch; /**< Deprecated, cudaLaunchCooperativeKernelMultiDevice is deprecated. */
size_t sharedMemPerBlockOptin; /**< Per device maximum shared memory per block usable by special opt in */
int pageableMemoryAccessUsesHostPageTables; /**< Device accesses pageable memory via the host's page tables */
int directManagedMemAccessFromHost; /**< Host can directly access managed memory on the device without migration. */
int maxBlocksPerMultiProcessor; /**< Maximum number of resident blocks per multiprocessor */
int accessPolicyMaxWindowSize; /**< The maximum value of ::cudaAccessPolicyWindow::num_bytes. */
size_t reservedSharedMemPerBlock; /**< Shared memory reserved by CUDA driver per block in bytes */
} cudaDeviceProp_t;
typedef struct cudart_handle {
void *handle;
uint16_t verbose;
cudartReturn_t (*cudaSetDevice)(int device);
cudartReturn_t (*cudaDeviceSynchronize)(void);
cudartReturn_t (*cudaDeviceReset)(void);
cudartReturn_t (*cudaMemGetInfo)(size_t *, size_t *);
cudartReturn_t (*cudaGetDeviceCount)(int *);
cudartReturn_t (*cudaDeviceGetAttribute)(int* value, cudartDeviceAttr_t attr, int device);
cudartReturn_t (*cudaDriverGetVersion) (int *driverVersion);
cudartReturn_t (*cudaGetDeviceProperties) (cudaDeviceProp_t* prop, int device);
} cudart_handle_t;
typedef struct cudart_init_resp {
char *err; // If err is non-null handle is invalid
cudart_handle_t ch;
int num_devices;
} cudart_init_resp_t;
void cudart_init(char *cudart_lib_path, cudart_init_resp_t *resp);
void cudart_bootstrap(cudart_handle_t ch, int device_id, mem_info_t *resp);
// TODO - if we keep this library longer term, add cudart_get_free
void cudart_release(cudart_handle_t ch);
#endif // __GPU_INFO_CUDART_H__
#endif // __APPLE__

View File

@@ -1,251 +0,0 @@
#ifndef __APPLE__ // TODO - maybe consider nvidia support on intel macs?
#include <string.h>
#include <inttypes.h>
#include "gpu_info_nvcuda.h"
void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
LOG(resp->ch.verbose, "initializing %s\n", nvcuda_lib_path);
CUresult ret;
resp->err = NULL;
resp->num_devices = 0;
resp->cudaErr = CUDA_SUCCESS;
const int buflen = 256;
char buf[buflen + 1];
int i;
struct lookup {
char *s;
void **p;
} l[] = {
{"cuInit", (void *)&resp->ch.cuInit},
{"cuDriverGetVersion", (void *)&resp->ch.cuDriverGetVersion},
{"cuDeviceGetCount", (void *)&resp->ch.cuDeviceGetCount},
{"cuDeviceGet", (void *)&resp->ch.cuDeviceGet},
{"cuDeviceGetAttribute", (void *)&resp->ch.cuDeviceGetAttribute},
{"cuDeviceGetUuid", (void *)&resp->ch.cuDeviceGetUuid},
{"cuDeviceGetName", (void *)&resp->ch.cuDeviceGetName},
{"cuCtxCreate_v3", (void *)&resp->ch.cuCtxCreate_v3},
{"cuMemGetInfo_v2", (void *)&resp->ch.cuMemGetInfo_v2},
{"cuCtxDestroy", (void *)&resp->ch.cuCtxDestroy},
{NULL, NULL},
};
resp->ch.handle = LOAD_LIBRARY(nvcuda_lib_path, RTLD_LAZY);
if (!resp->ch.handle) {
char *msg = LOAD_ERR();
LOG(resp->ch.verbose, "library %s load err: %s\n", nvcuda_lib_path, msg);
snprintf(buf, buflen,
"Unable to load %s library to query for Nvidia GPUs: %s",
nvcuda_lib_path, msg);
free(msg);
resp->err = strdup(buf);
resp->cudaErr = -1;
return;
}
for (i = 0; l[i].s != NULL; i++) {
*l[i].p = LOAD_SYMBOL(resp->ch.handle, l[i].s);
if (!*(l[i].p)) {
char *msg = LOAD_ERR();
LOG(resp->ch.verbose, "dlerr: %s\n", msg);
UNLOAD_LIBRARY(resp->ch.handle);
resp->ch.handle = NULL;
snprintf(buf, buflen, "symbol lookup for %s failed: %s", l[i].s,
msg);
free(msg);
resp->err = strdup(buf);
resp->cudaErr = -1;
return;
}
LOG(resp->ch.verbose, "dlsym: %s - %p\n", l[i].s, *l[i].p);
}
LOG(resp->ch.verbose, "calling cuInit\n");
ret = (*resp->ch.cuInit)(0);
if (ret != CUDA_SUCCESS) {
LOG(resp->ch.verbose, "cuInit err: %d\n", ret);
UNLOAD_LIBRARY(resp->ch.handle);
resp->ch.handle = NULL;
snprintf(buf, buflen, "cuda driver library init failure: %d", ret);
resp->err = strdup(buf);
resp->cudaErr = ret;
return;
}
int version = 0;
resp->ch.driver_major = 0;
resp->ch.driver_minor = 0;
// Report driver version if we're in verbose mode, ignore errors
LOG(resp->ch.verbose, "calling cuDriverGetVersion\n");
ret = (*resp->ch.cuDriverGetVersion)(&version);
if (ret != CUDA_SUCCESS) {
LOG(resp->ch.verbose, "cuDriverGetVersion failed: %d\n", ret);
} else {
LOG(resp->ch.verbose, "raw version 0x%x\n", version);
resp->ch.driver_major = version / 1000;
resp->ch.driver_minor = (version - (resp->ch.driver_major * 1000)) / 10;
LOG(resp->ch.verbose, "CUDA driver version: %d.%d\n", resp->ch.driver_major, resp->ch.driver_minor);
}
LOG(resp->ch.verbose, "calling cuDeviceGetCount\n");
ret = (*resp->ch.cuDeviceGetCount)(&resp->num_devices);
if (ret != CUDA_SUCCESS) {
LOG(resp->ch.verbose, "cuDeviceGetCount err: %d\n", ret);
UNLOAD_LIBRARY(resp->ch.handle);
resp->ch.handle = NULL;
snprintf(buf, buflen, "unable to get device count: %d", ret);
resp->err = strdup(buf);
resp->cudaErr = ret;
return;
}
LOG(resp->ch.verbose, "device count %d\n", resp->num_devices);
}
const int buflen = 256;
void nvcuda_bootstrap(nvcuda_handle_t h, int i, mem_info_t *resp) {
resp->err = NULL;
nvcudaMemory_t memInfo = {0,0};
CUresult ret;
CUdevice device = -1;
CUcontext ctx = NULL;
char buf[buflen + 1];
CUuuid uuid = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
if (h.handle == NULL) {
resp->err = strdup("cuda driver library handle isn't initialized");
return;
}
ret = (*h.cuDeviceGet)(&device, i);
if (ret != CUDA_SUCCESS) {
snprintf(buf, buflen, "cuda driver library device failed to initialize");
resp->err = strdup(buf);
return;
}
int major = 0;
int minor = 0;
ret = (*h.cuDeviceGetAttribute)(&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, device);
if (ret != CUDA_SUCCESS) {
LOG(h.verbose, "[%d] device major lookup failure: %d\n", i, ret);
} else {
ret = (*h.cuDeviceGetAttribute)(&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, device);
if (ret != CUDA_SUCCESS) {
LOG(h.verbose, "[%d] device minor lookup failure: %d\n", i, ret);
} else {
resp->minor = minor;
resp->major = major;
}
}
ret = (*h.cuDeviceGetUuid)(&uuid, device);
if (ret != CUDA_SUCCESS) {
LOG(h.verbose, "[%d] device uuid lookup failure: %d\n", i, ret);
snprintf(&resp->gpu_id[0], GPU_ID_LEN, "%d", i);
} else {
// GPU-d110a105-ac29-1d54-7b49-9c90440f215b
snprintf(&resp->gpu_id[0], GPU_ID_LEN,
"GPU-%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x",
uuid.bytes[0],
uuid.bytes[1],
uuid.bytes[2],
uuid.bytes[3],
uuid.bytes[4],
uuid.bytes[5],
uuid.bytes[6],
uuid.bytes[7],
uuid.bytes[8],
uuid.bytes[9],
uuid.bytes[10],
uuid.bytes[11],
uuid.bytes[12],
uuid.bytes[13],
uuid.bytes[14],
uuid.bytes[15]
);
}
ret = (*h.cuDeviceGetName)(&resp->gpu_name[0], GPU_NAME_LEN, device);
if (ret != CUDA_SUCCESS) {
LOG(h.verbose, "[%d] device name lookup failure: %d\n", i, ret);
resp->gpu_name[0] = '\0';
}
// To get memory we have to set (and release) a context
ret = (*h.cuCtxCreate_v3)(&ctx, NULL, 0, 0, device);
if (ret != CUDA_SUCCESS) {
snprintf(buf, buflen, "cuda driver library failed to get device context %d", ret);
resp->err = strdup(buf);
return;
}
ret = (*h.cuMemGetInfo_v2)(&memInfo.free, &memInfo.total);
if (ret != CUDA_SUCCESS) {
snprintf(buf, buflen, "cuda driver library device memory info lookup failure %d", ret);
resp->err = strdup(buf);
// Best effort on failure...
(*h.cuCtxDestroy)(ctx);
return;
}
resp->total = memInfo.total;
resp->free = memInfo.free;
LOG(h.verbose, "[%s] CUDA totalMem %" PRId64 "mb\n", resp->gpu_id, resp->total / 1024 / 1024);
LOG(h.verbose, "[%s] CUDA freeMem %" PRId64 "mb\n", resp->gpu_id, resp->free / 1024 / 1024);
LOG(h.verbose, "[%s] Compute Capability %d.%d\n", resp->gpu_id, resp->major, resp->minor);
ret = (*h.cuCtxDestroy)(ctx);
if (ret != CUDA_SUCCESS) {
LOG(1, "cuda driver library failed to release device context %d", ret);
}
}
void nvcuda_get_free(nvcuda_handle_t h, int i, uint64_t *free, uint64_t *total) {
CUresult ret;
CUcontext ctx = NULL;
CUdevice device = -1;
*free = 0;
*total = 0;
ret = (*h.cuDeviceGet)(&device, i);
if (ret != CUDA_SUCCESS) {
LOG(1, "cuda driver library device failed to initialize");
return;
}
// To get memory we have to set (and release) a context
ret = (*h.cuCtxCreate_v3)(&ctx, NULL, 0, 0, device);
if (ret != CUDA_SUCCESS) {
LOG(1, "cuda driver library failed to get device context %d", ret);
return;
}
ret = (*h.cuMemGetInfo_v2)(free, total);
if (ret != CUDA_SUCCESS) {
LOG(1, "cuda driver library device memory info lookup failure %d", ret);
// Best effort on failure...
(*h.cuCtxDestroy)(ctx);
return;
}
ret = (*h.cuCtxDestroy)(ctx);
if (ret != CUDA_SUCCESS) {
LOG(1, "cuda driver library failed to release device context %d", ret);
}
}
void nvcuda_release(nvcuda_handle_t h) {
LOG(h.verbose, "releasing cuda driver library\n");
UNLOAD_LIBRARY(h.handle);
// TODO and other context release logic?
h.handle = NULL;
}
#endif // __APPLE__

View File

@@ -1,79 +0,0 @@
#ifndef __APPLE__
#ifndef __GPU_INFO_NVCUDA_H__
#define __GPU_INFO_NVCUDA_H__
#include "gpu_info.h"
// Just enough typedef's to dlopen/dlsym for memory information
typedef enum cudaError_enum {
CUDA_SUCCESS = 0,
CUDA_ERROR_INVALID_VALUE = 1,
CUDA_ERROR_OUT_OF_MEMORY = 2,
CUDA_ERROR_NOT_INITIALIZED = 3,
CUDA_ERROR_INSUFFICIENT_DRIVER = 35,
CUDA_ERROR_NO_DEVICE = 100,
CUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803,
CUDA_ERROR_UNKNOWN = 999,
// Other values omitted for now...
} CUresult;
typedef enum CUdevice_attribute_enum {
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,
// TODO - not yet wired up but may be useful for Jetson or other
// integrated GPU scenarios with shared memory
CU_DEVICE_ATTRIBUTE_INTEGRATED = 18
} CUdevice_attribute;
typedef void *nvcudaDevice_t; // Opaque is sufficient
typedef struct nvcudaMemory_st {
uint64_t total;
uint64_t free;
} nvcudaMemory_t;
typedef struct nvcudaDriverVersion {
int major;
int minor;
} nvcudaDriverVersion_t;
typedef struct CUuuid_st {
unsigned char bytes[16];
} CUuuid;
typedef int CUdevice;
typedef void* CUcontext;
typedef struct nvcuda_handle {
void *handle;
uint16_t verbose;
int driver_major;
int driver_minor;
CUresult (*cuInit)(unsigned int Flags);
CUresult (*cuDriverGetVersion)(int *driverVersion);
CUresult (*cuDeviceGetCount)(int *);
CUresult (*cuDeviceGet)(CUdevice* device, int ordinal);
CUresult (*cuDeviceGetAttribute)(int* pi, CUdevice_attribute attrib, CUdevice dev);
CUresult (*cuDeviceGetUuid)(CUuuid* uuid, CUdevice dev); // signature compatible with cuDeviceGetUuid_v2
CUresult (*cuDeviceGetName)(char *name, int len, CUdevice dev);
// Context specific aspects
CUresult (*cuCtxCreate_v3)(CUcontext* pctx, void *params, int len, unsigned int flags, CUdevice dev);
CUresult (*cuMemGetInfo_v2)(uint64_t* free, uint64_t* total);
CUresult (*cuCtxDestroy)(CUcontext ctx);
} nvcuda_handle_t;
typedef struct nvcuda_init_resp {
char *err; // If err is non-null handle is invalid
nvcuda_handle_t ch;
int num_devices;
CUresult cudaErr;
} nvcuda_init_resp_t;
void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp);
void nvcuda_bootstrap(nvcuda_handle_t ch, int device_id, mem_info_t *resp);
void nvcuda_get_free(nvcuda_handle_t ch, int device_id, uint64_t *free, uint64_t *total);
void nvcuda_release(nvcuda_handle_t ch);
#endif // __GPU_INFO_NVCUDA_H__
#endif // __APPLE__

View File

@@ -1,104 +0,0 @@
#ifndef __APPLE__ // TODO - maybe consider nvidia support on intel macs?
#include <string.h>
#include "gpu_info_nvml.h"
void nvml_init(char *nvml_lib_path, nvml_init_resp_t *resp) {
nvmlReturn_t ret;
resp->err = NULL;
const int buflen = 256;
char buf[buflen + 1];
int i;
struct lookup {
char *s;
void **p;
} l[] = {
{"nvmlInit_v2", (void *)&resp->ch.nvmlInit_v2},
{"nvmlShutdown", (void *)&resp->ch.nvmlShutdown},
{"nvmlDeviceGetHandleByUUID", (void *)&resp->ch.nvmlDeviceGetHandleByUUID},
{"nvmlDeviceGetMemoryInfo", (void *)&resp->ch.nvmlDeviceGetMemoryInfo},
{NULL, NULL},
};
resp->ch.handle = LOAD_LIBRARY(nvml_lib_path, RTLD_LAZY);
if (!resp->ch.handle) {
char *msg = LOAD_ERR();
LOG(resp->ch.verbose, "library %s load err: %s\n", nvml_lib_path, msg);
snprintf(buf, buflen,
"Unable to load %s library to query for Nvidia GPUs: %s",
nvml_lib_path, msg);
free(msg);
resp->err = strdup(buf);
return;
}
// TODO once we've squashed the remaining corner cases remove this log
// LOG(resp->ch.verbose, "wiring nvidia management library functions in %s\n", nvml_lib_path);
for (i = 0; l[i].s != NULL; i++) {
// TODO once we've squashed the remaining corner cases remove this log
// LOG(resp->ch.verbose, "dlsym: %s\n", l[i].s);
*l[i].p = LOAD_SYMBOL(resp->ch.handle, l[i].s);
if (!*(l[i].p)) {
resp->ch.handle = NULL;
char *msg = LOAD_ERR();
LOG(resp->ch.verbose, "dlerr: %s\n", msg);
UNLOAD_LIBRARY(resp->ch.handle);
snprintf(buf, buflen, "symbol lookup for %s failed: %s", l[i].s,
msg);
free(msg);
resp->err = strdup(buf);
return;
}
}
ret = (*resp->ch.nvmlInit_v2)();
if (ret != NVML_SUCCESS) {
LOG(resp->ch.verbose, "nvmlInit_v2 err: %d\n", ret);
UNLOAD_LIBRARY(resp->ch.handle);
resp->ch.handle = NULL;
snprintf(buf, buflen, "nvml vram init failure: %d", ret);
resp->err = strdup(buf);
return;
}
}
void nvml_get_free(nvml_handle_t h, char *uuid, uint64_t *free, uint64_t *total, uint64_t *used) {
nvmlDevice_t device;
nvmlMemory_t memInfo = {0};
nvmlReturn_t ret;
ret = (*h.nvmlDeviceGetHandleByUUID)((const char *)(uuid), &device);
if (ret != NVML_SUCCESS) {
LOG(1, "unable to get device handle %s: %d", uuid, ret);
*free = 0;
return;
}
ret = (*h.nvmlDeviceGetMemoryInfo)(device, &memInfo);
if (ret != NVML_SUCCESS) {
LOG(1, "device memory info lookup failure %s: %d", uuid, ret);
*free = 0;
return;
}
*free = memInfo.free;
*total = memInfo.total;
*used = memInfo.used;
}
void nvml_release(nvml_handle_t h) {
LOG(h.verbose, "releasing nvml library\n");
nvmlReturn_t ret;
ret = (*h.nvmlShutdown)();
if (ret != NVML_SUCCESS) {
LOG(1, "error during nvmlShutdown %d", ret);
}
UNLOAD_LIBRARY(h.handle);
h.handle = NULL;
}
#endif // __APPLE__

View File

@@ -1,48 +0,0 @@
#ifndef __APPLE__
#ifndef __GPU_INFO_NVML_H__
#define __GPU_INFO_NVML_H__
#include "gpu_info.h"
// Just enough typedef's to dlopen/dlsym for memory information
typedef enum nvmlReturn_enum {
NVML_SUCCESS = 0,
// Other values omitted for now...
} nvmlReturn_t;
typedef void *nvmlDevice_t; // Opaque is sufficient
typedef struct nvmlMemory_st {
unsigned long long total;
unsigned long long free;
unsigned long long used;
} nvmlMemory_t;
typedef enum nvmlBrandType_enum
{
NVML_BRAND_UNKNOWN = 0,
} nvmlBrandType_t;
typedef struct nvml_handle {
void *handle;
uint16_t verbose;
nvmlReturn_t (*nvmlInit_v2)(void);
nvmlReturn_t (*nvmlShutdown)(void);
nvmlReturn_t (*nvmlDeviceGetHandleByUUID)(const char *, nvmlDevice_t *);
nvmlReturn_t (*nvmlDeviceGetMemoryInfo)(nvmlDevice_t, nvmlMemory_t *);
} nvml_handle_t;
typedef struct nvml_init_resp {
char *err; // If err is non-null handle is invalid
nvml_handle_t ch;
} nvml_init_resp_t;
typedef struct nvml_compute_capability {
char *err;
int major;
int minor;
} nvml_compute_capability_t;
void nvml_init(char *nvml_lib_path, nvml_init_resp_t *resp);
void nvml_get_free(nvml_handle_t ch, char *uuid, uint64_t *free, uint64_t *total, uint64_t *used);
void nvml_release(nvml_handle_t ch);
#endif // __GPU_INFO_NVML_H__
#endif // __APPLE__

View File

@@ -1,259 +0,0 @@
#ifndef __APPLE__
#include "gpu_info_oneapi.h"
#include <string.h>
void oneapi_init(char *oneapi_lib_path, oneapi_init_resp_t *resp) {
ze_result_t ret;
resp->err = NULL;
resp->oh.devices = NULL;
resp->oh.num_devices = NULL;
resp->oh.drivers = NULL;
resp->oh.num_drivers = 0;
const int buflen = 256;
char buf[buflen + 1];
int i, d;
struct lookup {
char *s;
void **p;
} l[] = {
{"zesInit", (void *)&resp->oh.zesInit},
{"zesDriverGet", (void *)&resp->oh.zesDriverGet},
{"zesDeviceGet", (void *)&resp->oh.zesDeviceGet},
{"zesDeviceGetProperties", (void *)&resp->oh.zesDeviceGetProperties},
{"zesDeviceEnumMemoryModules",
(void *)&resp->oh.zesDeviceEnumMemoryModules},
{"zesMemoryGetProperties", (void *)&resp->oh.zesMemoryGetProperties},
{"zesMemoryGetState", (void *)&resp->oh.zesMemoryGetState},
{NULL, NULL},
};
resp->oh.handle = LOAD_LIBRARY(oneapi_lib_path, RTLD_LAZY);
if (!resp->oh.handle) {
char *msg = LOAD_ERR();
snprintf(buf, buflen,
"Unable to load %s library to query for Intel GPUs: %s\n",
oneapi_lib_path, msg);
free(msg);
resp->err = strdup(buf);
return;
}
// TODO once we've squashed the remaining corner cases remove this log
LOG(resp->oh.verbose,
"wiring Level-Zero management library functions in %s\n",
oneapi_lib_path);
for (i = 0; l[i].s != NULL; i++) {
// TODO once we've squashed the remaining corner cases remove this log
LOG(resp->oh.verbose, "dlsym: %s\n", l[i].s);
*l[i].p = LOAD_SYMBOL(resp->oh.handle, l[i].s);
if (!*(l[i].p)) {
resp->oh.handle = NULL;
char *msg = LOAD_ERR();
LOG(resp->oh.verbose, "dlerr: %s\n", msg);
UNLOAD_LIBRARY(resp->oh.handle);
snprintf(buf, buflen, "symbol lookup for %s failed: %s", l[i].s, msg);
free(msg);
resp->err = strdup(buf);
return;
}
}
LOG(resp->oh.verbose, "calling zesInit\n");
ret = (*resp->oh.zesInit)(0);
if (ret != ZE_RESULT_SUCCESS) {
LOG(resp->oh.verbose, "zesInit err: %x\n", ret);
snprintf(buf, buflen, "oneapi vram init failure: %x", ret);
resp->err = strdup(buf);
oneapi_release(resp->oh);
return;
}
LOG(resp->oh.verbose, "calling zesDriverGet\n");
ret = (*resp->oh.zesDriverGet)(&resp->oh.num_drivers, NULL);
if (ret != ZE_RESULT_SUCCESS) {
LOG(resp->oh.verbose, "zesDriverGet err: %x\n", ret);
snprintf(buf, buflen, "unable to get driver count: %x", ret);
resp->err = strdup(buf);
oneapi_release(resp->oh);
return;
}
LOG(resp->oh.verbose, "oneapi driver count: %d\n", resp->oh.num_drivers);
resp->oh.drivers = malloc(resp->oh.num_drivers * sizeof(zes_driver_handle_t));
resp->oh.num_devices = malloc(resp->oh.num_drivers * sizeof(uint32_t));
memset(&resp->oh.num_devices[0], 0, resp->oh.num_drivers * sizeof(uint32_t));
resp->oh.devices =
malloc(resp->oh.num_drivers * sizeof(zes_device_handle_t *));
ret = (*resp->oh.zesDriverGet)(&resp->oh.num_drivers, &resp->oh.drivers[0]);
if (ret != ZE_RESULT_SUCCESS) {
LOG(resp->oh.verbose, "zesDriverGet err: %x\n", ret);
snprintf(buf, buflen, "unable to get driver count: %x", ret);
resp->err = strdup(buf);
oneapi_release(resp->oh);
return;
}
for (d = 0; d < resp->oh.num_drivers; d++) {
LOG(resp->oh.verbose, "calling zesDeviceGet count %d: %p\n", d, resp->oh.drivers[d]);
ret = (*resp->oh.zesDeviceGet)(resp->oh.drivers[d],
&resp->oh.num_devices[d], NULL);
if (ret != ZE_RESULT_SUCCESS) {
LOG(resp->oh.verbose, "zesDeviceGet err: %x\n", ret);
snprintf(buf, buflen, "unable to get device count: %x", ret);
resp->err = strdup(buf);
oneapi_release(resp->oh);
return;
}
resp->oh.devices[d] =
malloc(resp->oh.num_devices[d] * sizeof(zes_device_handle_t));
ret = (*resp->oh.zesDeviceGet)(
resp->oh.drivers[d], &resp->oh.num_devices[d], resp->oh.devices[d]);
if (ret != ZE_RESULT_SUCCESS) {
LOG(resp->oh.verbose, "zesDeviceGet err: %x\n", ret);
snprintf(buf, buflen, "unable to get device count: %x", ret);
resp->err = strdup(buf);
oneapi_release(resp->oh);
return;
}
}
return;
}
void oneapi_check_vram(oneapi_handle_t h, int driver, int device,
mem_info_t *resp) {
ze_result_t ret;
resp->err = NULL;
uint64_t totalMem = 0;
uint64_t usedMem = 0;
const int buflen = 256;
char buf[buflen + 1];
int i, d, m;
if (h.handle == NULL) {
resp->err = strdup("Level-Zero handle not initialized");
return;
}
if (driver > h.num_drivers || device > h.num_devices[driver]) {
resp->err = strdup("driver of device index out of bounds");
return;
}
resp->total = 0;
resp->free = 0;
zes_device_ext_properties_t ext_props;
ext_props.stype = ZES_STRUCTURE_TYPE_DEVICE_EXT_PROPERTIES;
ext_props.pNext = NULL;
zes_device_properties_t props;
props.stype = ZES_STRUCTURE_TYPE_DEVICE_PROPERTIES;
props.pNext = &ext_props;
ret = (*h.zesDeviceGetProperties)(h.devices[driver][device], &props);
if (ret != ZE_RESULT_SUCCESS) {
snprintf(buf, buflen, "unable to get device properties: %d", ret);
resp->err = strdup(buf);
return;
}
snprintf(&resp->gpu_name[0], GPU_NAME_LEN, "%s", props.modelName);
// TODO this needs to map to ONEAPI_DEVICE_SELECTOR syntax
// (this is probably wrong...)
// TODO - the driver isn't included - what if there are multiple drivers?
snprintf(&resp->gpu_id[0], GPU_ID_LEN, "%d", device);
if (h.verbose) {
// When in verbose mode, report more information about
// the card we discover.
LOG(h.verbose, "[%d:%d] oneAPI device name: %s\n", driver, device,
props.modelName);
LOG(h.verbose, "[%d:%d] oneAPI brand: %s\n", driver, device,
props.brandName);
LOG(h.verbose, "[%d:%d] oneAPI vendor: %s\n", driver, device,
props.vendorName);
LOG(h.verbose, "[%d:%d] oneAPI S/N: %s\n", driver, device,
props.serialNumber);
LOG(h.verbose, "[%d:%d] oneAPI board number: %s\n", driver, device,
props.boardNumber);
}
// TODO
// Compute Capability equivalent in resp->major, resp->minor, resp->patch
uint32_t memCount = 0;
ret = (*h.zesDeviceEnumMemoryModules)(h.devices[driver][device], &memCount,
NULL);
if (ret != ZE_RESULT_SUCCESS) {
snprintf(buf, buflen, "unable to enumerate Level-Zero memory modules: %x",
ret);
resp->err = strdup(buf);
return;
}
LOG(h.verbose, "discovered %d Level-Zero memory modules\n", memCount);
zes_mem_handle_t *mems = malloc(memCount * sizeof(zes_mem_handle_t));
(*h.zesDeviceEnumMemoryModules)(h.devices[driver][device], &memCount, mems);
for (m = 0; m < memCount; m++) {
zes_mem_state_t state;
state.stype = ZES_STRUCTURE_TYPE_MEM_STATE;
state.pNext = NULL;
ret = (*h.zesMemoryGetState)(mems[m], &state);
if (ret != ZE_RESULT_SUCCESS) {
snprintf(buf, buflen, "unable to get memory state: %x", ret);
resp->err = strdup(buf);
free(mems);
return;
}
resp->total += state.size;
resp->free += state.free;
}
free(mems);
}
void oneapi_release(oneapi_handle_t h) {
int d;
LOG(h.verbose, "releasing oneapi library\n");
for (d = 0; d < h.num_drivers; d++) {
if (h.devices != NULL && h.devices[d] != NULL) {
free(h.devices[d]);
}
}
if (h.devices != NULL) {
free(h.devices);
h.devices = NULL;
}
if (h.num_devices != NULL) {
free(h.num_devices);
h.num_devices = NULL;
}
if (h.drivers != NULL) {
free(h.drivers);
h.drivers = NULL;
}
h.num_drivers = 0;
UNLOAD_LIBRARY(h.handle);
h.handle = NULL;
}
int oneapi_get_device_count(oneapi_handle_t h, int driver) {
if (h.handle == NULL || h.num_devices == NULL) {
return 0;
}
if (driver > h.num_drivers) {
return 0;
}
return (int)h.num_devices[driver];
}
#endif // __APPLE__

View File

@@ -1,203 +0,0 @@
#ifndef __APPLE__
#ifndef __GPU_INFO_ONEAPI_H__
#define __GPU_INFO_ONEAPI_H__
#include "gpu_info.h"
#define ZE_MAX_DEVICE_NAME 256
#define ZE_MAX_DEVICE_UUID_SIZE 16
#define ZES_STRING_PROPERTY_SIZE 64
#define ZE_BIT(_i) (1 << _i)
// Just enough typedef's to dlopen/dlsym for memory information
typedef enum ze_result_t {
ZE_RESULT_SUCCESS = 0,
// Other values omitted for now...
} ze_result_t;
typedef uint8_t ze_bool_t;
typedef struct _zes_driver_handle_t *zes_driver_handle_t;
typedef struct _zes_device_handle_t *zes_device_handle_t;
typedef struct _zes_mem_handle_t *zes_mem_handle_t;
typedef enum _ze_structure_type_t {
ZE_STRUCTURE_TYPE_FORCE_UINT32 = 0x7fffffff
} ze_structure_type_t;
typedef enum _zes_structure_type_t {
ZES_STRUCTURE_TYPE_DEVICE_PROPERTIES = 0x1,
ZES_STRUCTURE_TYPE_MEM_PROPERTIES = 0xb,
ZES_STRUCTURE_TYPE_MEM_STATE = 0x1e,
ZES_STRUCTURE_TYPE_DEVICE_EXT_PROPERTIES = 0x2d,
ZES_STRUCTURE_TYPE_FORCE_UINT32 = 0x7fffffff
} zes_structure_type_t;
typedef enum _zes_mem_type_t {
ZES_MEM_TYPE_FORCE_UINT32 = 0x7fffffff
} zes_mem_type_t;
typedef enum _zes_mem_loc_t {
ZES_MEM_LOC_SYSTEM = 0,
ZES_MEM_LOC_DEVICE = 1,
ZES_MEM_LOC_FORCE_UINT32 = 0x7fffffff
} zes_mem_loc_t;
typedef enum _zes_mem_health_t {
ZES_MEM_HEALTH_FORCE_UINT32 = 0x7fffffff
} zes_mem_health_t;
typedef struct _ze_device_uuid_t {
uint8_t id[ZE_MAX_DEVICE_UUID_SIZE];
} ze_device_uuid_t;
typedef struct _zes_uuid_t {
uint8_t id[ZE_MAX_DEVICE_UUID_SIZE];
} zes_uuid_t;
typedef enum _ze_device_type_t {
ZE_DEVICE_TYPE_GPU = 1,
ZE_DEVICE_TYPE_CPU = 2,
ZE_DEVICE_TYPE_FPGA = 3,
ZE_DEVICE_TYPE_MCA = 4,
ZE_DEVICE_TYPE_VPU = 5,
ZE_DEVICE_TYPE_FORCE_UINT32 = 0x7fffffff
} ze_device_type_t;
typedef enum _zes_device_type_t {
ZES_DEVICE_TYPE_GPU = 1,
ZES_DEVICE_TYPE_CPU = 2,
ZES_DEVICE_TYPE_FPGA = 3,
ZES_DEVICE_TYPE_MCA = 4,
ZES_DEVICE_TYPE_VPU = 5,
ZES_DEVICE_TYPE_FORCE_UINT32 = 0x7fffffff
} zes_device_type_t;
typedef uint32_t ze_device_property_flags_t;
typedef enum _ze_device_property_flag_t {
ZE_DEVICE_PROPERTY_FLAG_INTEGRATED = ZE_BIT(0),
ZE_DEVICE_PROPERTY_FLAG_SUBDEVICE = ZE_BIT(1),
ZE_DEVICE_PROPERTY_FLAG_ECC = ZE_BIT(2),
ZE_DEVICE_PROPERTY_FLAG_ONDEMANDPAGING = ZE_BIT(3),
ZE_DEVICE_PROPERTY_FLAG_FORCE_UINT32 = 0x7fffffff
} ze_device_property_flag_t;
typedef uint32_t zes_device_property_flags_t;
typedef enum _zes_device_property_flag_t {
ZES_DEVICE_PROPERTY_FLAG_INTEGRATED = ZE_BIT(0),
ZES_DEVICE_PROPERTY_FLAG_SUBDEVICE = ZE_BIT(1),
ZES_DEVICE_PROPERTY_FLAG_ECC = ZE_BIT(2),
ZES_DEVICE_PROPERTY_FLAG_ONDEMANDPAGING = ZE_BIT(3),
ZES_DEVICE_PROPERTY_FLAG_FORCE_UINT32 = 0x7fffffff
} zes_device_property_flag_t;
typedef struct _ze_device_properties_t {
ze_structure_type_t stype;
void *pNext;
ze_device_type_t type;
uint32_t vendorId;
uint32_t deviceId;
ze_device_property_flags_t flags;
uint32_t subdeviceId;
uint32_t coreClockRate;
uint64_t maxMemAllocSize;
uint32_t maxHardwareContexts;
uint32_t maxCommandQueuePriority;
uint32_t numThreadsPerEU;
uint32_t physicalEUSimdWidth;
uint32_t numEUsPerSubslice;
uint32_t numSubslicesPerSlice;
uint32_t numSlices;
uint64_t timerResolution;
uint32_t timestampValidBits;
uint32_t kernelTimestampValidBits;
ze_device_uuid_t uuid;
char name[ZE_MAX_DEVICE_NAME];
} ze_device_properties_t;
typedef struct _zes_device_properties_t {
zes_structure_type_t stype;
void *pNext;
ze_device_properties_t core;
uint32_t numSubdevices;
char serialNumber[ZES_STRING_PROPERTY_SIZE];
char boardNumber[ZES_STRING_PROPERTY_SIZE];
char brandName[ZES_STRING_PROPERTY_SIZE];
char modelName[ZES_STRING_PROPERTY_SIZE];
char vendorName[ZES_STRING_PROPERTY_SIZE];
char driverVersion[ZES_STRING_PROPERTY_SIZE];
} zes_device_properties_t;
typedef struct _zes_device_ext_properties_t {
zes_structure_type_t stype;
void *pNext;
zes_uuid_t uuid;
zes_device_type_t type;
zes_device_property_flags_t flags;
} zes_device_ext_properties_t;
typedef struct _zes_mem_properties_t {
zes_structure_type_t stype;
void *pNext;
zes_mem_type_t type;
ze_bool_t onSubdevice;
uint32_t subdeviceId;
zes_mem_loc_t location;
uint64_t physicalSize;
int32_t busWidth;
int32_t numChannels;
} zes_mem_properties_t;
typedef struct _zes_mem_state_t {
zes_structure_type_t stype;
const void *pNext;
zes_mem_health_t health;
uint64_t free;
uint64_t size;
} zes_mem_state_t;
typedef struct oneapi_handle {
void *handle;
uint16_t verbose;
uint32_t num_drivers;
zes_driver_handle_t *drivers;
uint32_t *num_devices;
zes_device_handle_t **devices;
// TODO Driver major, minor information
// int driver_major;
// int driver_minor;
ze_result_t (*zesInit)(int);
ze_result_t (*zesDriverGet)(uint32_t *pCount, zes_driver_handle_t *phDrivers);
ze_result_t (*zesDeviceGet)(zes_driver_handle_t hDriver, uint32_t *pCount,
zes_device_handle_t *phDevices);
ze_result_t (*zesDeviceGetProperties)(zes_device_handle_t hDevice,
zes_device_properties_t *pProperties);
ze_result_t (*zesDeviceEnumMemoryModules)(zes_device_handle_t hDevice,
uint32_t *pCount,
zes_mem_handle_t *phMemory);
ze_result_t (*zesMemoryGetProperties)(zes_mem_handle_t hMemory,
zes_mem_properties_t *pProperties);
ze_result_t (*zesMemoryGetState)(zes_mem_handle_t hMemory,
zes_mem_state_t *pState);
} oneapi_handle_t;
typedef struct oneapi_init_resp {
char *err; // If err is non-null handle is invalid
oneapi_handle_t oh;
} oneapi_init_resp_t;
typedef struct oneapi_version_resp {
ze_result_t status;
char *str; // Contains version or error string if status != 0
} oneapi_version_resp_t;
void oneapi_init(char *oneapi_lib_path, oneapi_init_resp_t *resp);
void oneapi_check_vram(oneapi_handle_t h, int driver, int device,
mem_info_t *resp);
void oneapi_release(oneapi_handle_t h);
int oneapi_get_device_count(oneapi_handle_t h, int driver);
#endif // __GPU_INFO_INTEL_H__
#endif // __APPLE__

View File

@@ -1,21 +0,0 @@
//go:build linux || windows
package discover
import (
"log/slog"
"strings"
)
func oneapiGetVisibleDevicesEnv(gpuInfo []GpuInfo) (string, string) {
ids := []string{}
for _, info := range gpuInfo {
if info.Library != "oneapi" {
// TODO shouldn't happen if things are wired correctly...
slog.Debug("oneapiGetVisibleDevicesEnv skipping over non-sycl device", "library", info.Library)
continue
}
ids = append(ids, info.ID)
}
return "ONEAPI_DEVICE_SELECTOR", "level_zero:" + strings.Join(ids, ",")
}

View File

@@ -1,60 +0,0 @@
package discover
import (
"runtime"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestBasicGetGPUInfo(t *testing.T) {
info := GetGPUInfo()
assert.NotEmpty(t, len(info))
assert.Contains(t, "cuda rocm cpu metal", info[0].Library)
if info[0].Library != "cpu" {
assert.Greater(t, info[0].TotalMemory, uint64(0))
assert.Greater(t, info[0].FreeMemory, uint64(0))
}
}
func TestCPUMemInfo(t *testing.T) {
info, err := GetCPUMem()
require.NoError(t, err)
switch runtime.GOOS {
case "darwin":
t.Skip("CPU memory not populated on darwin")
case "linux", "windows":
assert.Greater(t, info.TotalMemory, uint64(0))
assert.Greater(t, info.FreeMemory, uint64(0))
default:
return
}
}
func TestByLibrary(t *testing.T) {
type testCase struct {
input []GpuInfo
expect int
}
testCases := map[string]*testCase{
"empty": {input: []GpuInfo{}, expect: 0},
"cpu": {input: []GpuInfo{{Library: "cpu"}}, expect: 1},
"cpu + GPU": {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda"}}, expect: 2},
"cpu + 2 GPU no variant": {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda"}, {Library: "cuda"}}, expect: 2},
"cpu + 2 GPU same variant": {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}, expect: 2},
"cpu + 2 GPU diff variant": {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}, expect: 3},
}
for k, v := range testCases {
t.Run(k, func(t *testing.T) {
resp := (GpuInfoList)(v.input).ByLibrary()
if len(resp) != v.expect {
t.Fatalf("expected length %d, got %d => %+v", v.expect, len(resp), resp)
}
})
}
}
// TODO - add some logic to figure out card type through other means and actually verify we got back what we expected

488
discover/runner.go Normal file
View File

@@ -0,0 +1,488 @@
package discover
// Runner based GPU discovery
import (
"context"
"io"
"log/slog"
"os"
"os/exec"
"path/filepath"
"runtime"
"sort"
"strconv"
"strings"
"sync"
"time"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/format"
"github.com/ollama/ollama/llm"
"github.com/ollama/ollama/logutil"
"github.com/ollama/ollama/ml"
)
var (
deviceMu sync.Mutex
devices []ml.DeviceInfo
libDirs map[string]struct{}
rocmDir string
exe string
bootstrapped bool
)
func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.DeviceInfo {
deviceMu.Lock()
defer deviceMu.Unlock()
startDiscovery := time.Now()
msg := "overall device VRAM discovery took"
defer func() {
slog.Debug(msg, "duration", time.Since(startDiscovery))
}()
if !bootstrapped {
msg = "GPU bootstrap discovery took"
libDirs = make(map[string]struct{})
var err error
exe, err = os.Executable()
if err != nil {
slog.Error("unable to lookup executable path", "error", err)
return nil
}
if eval, err := filepath.EvalSymlinks(exe); err == nil {
exe = eval
}
files, err := filepath.Glob(filepath.Join(LibOllamaPath, "*", "*ggml-*"))
if err != nil {
slog.Debug("unable to lookup runner library directories", "error", err)
}
for _, file := range files {
libDirs[filepath.Dir(file)] = struct{}{}
}
// Our current packaging model places ggml-hip in the main directory
// but keeps rocm in an isolated directory. We have to add it to
// the [LD_LIBRARY_]PATH so ggml-hip will load properly
rocmDir = filepath.Join(LibOllamaPath, "rocm")
if _, err := os.Stat(rocmDir); err != nil {
rocmDir = ""
}
if len(libDirs) == 0 {
libDirs[""] = struct{}{}
}
slog.Info("discovering available GPUs...")
requested := envconfig.LLMLibrary()
jetpack := cudaJetpack()
// For our initial discovery pass, we gather all the known GPUs through
// all the libraries that were detected. This pass may include GPUs that
// are enumerated, but not actually supported.
// We run this in serial to avoid potentially initializing a GPU multiple
// times concurrently leading to memory contention
// TODO refactor so we group the lib dirs and do serial per version, but parallel for different libs
for dir := range libDirs {
bootstrapTimeout := 30 * time.Second
var dirs []string
if dir != "" {
if requested != "" && filepath.Base(dir) != requested {
slog.Debug("skipping available library at users request", "requested", requested, "libDir", dir)
continue
} else if jetpack != "" && filepath.Base(dir) != "cuda_"+jetpack {
continue
}
}
if dir == "" {
dirs = []string{LibOllamaPath}
} else {
dirs = []string{LibOllamaPath, dir}
}
// ROCm can take a long time on some systems, so give it more time before giving up
if dir != "" && strings.Contains(filepath.Base(dir), "rocm") {
bootstrapTimeout = 60 * time.Second
}
// Typically bootstrapping takes < 1s, but on some systems, with devices
// in low power/idle mode, initialization can take multiple seconds. We
// set a long timeout just for bootstrap discovery to reduce the chance
// of giving up too quickly
ctx1stPass, cancel := context.WithTimeout(ctx, bootstrapTimeout)
defer cancel()
// For this pass, we retain duplicates in case any are incompatible with some libraries
devices = append(devices, bootstrapDevices(ctx1stPass, dirs, nil)...)
}
// In the second pass, we more deeply initialize the GPUs to weed out devices that
// aren't supported by a given library. We run this phase in parallel to speed up discovery.
slog.Debug("filtering out unsupported or overlapping GPU library combinations", "count", len(devices))
ctx2ndPass, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
var wg sync.WaitGroup
needsDelete := make([]bool, len(devices))
supportedMu := sync.Mutex{}
supported := make(map[string]map[string]map[string]int) // [Library][libDir][ID] = pre-deletion devices index
for i := range devices {
libDir := devices[i].LibraryPath[len(devices[i].LibraryPath)-1]
if devices[i].Library == "Metal" {
continue
}
slog.Debug("verifying GPU is supported", "library", libDir, "description", devices[i].Description, "compute", devices[i].Compute(), "pci_id", devices[i].PCIID)
wg.Add(1)
go func(i int) {
defer wg.Done()
var envVar string
id := devices[i].ID
if devices[i].Library == "ROCm" {
if runtime.GOOS != "linux" {
envVar = "HIP_VISIBLE_DEVICES"
} else {
envVar = "ROCR_VISIBLE_DEVICES"
}
} else if devices[i].Library == "CUDA" {
envVar = "CUDA_VISIBLE_DEVICES"
} else if devices[i].Library == "Vulkan" {
id = devices[i].FilteredID
envVar = "GGML_VK_VISIBLE_DEVICES"
} else {
slog.Error("Unknown Library:" + devices[i].Library)
}
extraEnvs := map[string]string{
"GGML_CUDA_INIT": "1", // force deep initialization to trigger crash on unsupported GPUs
envVar: id, // Filter to just this one GPU
}
if len(bootstrapDevices(ctx2ndPass, devices[i].LibraryPath, extraEnvs)) == 0 {
needsDelete[i] = true
} else {
supportedMu.Lock()
if _, ok := supported[devices[i].Library]; !ok {
supported[devices[i].Library] = make(map[string]map[string]int)
}
if _, ok := supported[devices[i].Library][libDir]; !ok {
supported[devices[i].Library][libDir] = make(map[string]int)
}
supported[devices[i].Library][libDir][devices[i].ID] = i
supportedMu.Unlock()
}
}(i)
}
wg.Wait()
logutil.Trace("supported GPU library combinations", "supported", supported)
filterOutVulkanThatAreSupportedByOtherGPU(needsDelete)
// Mark for deletion any overlaps - favoring the library version that can cover all GPUs if possible
filterOverlapByLibrary(supported, needsDelete)
// TODO if we ever support multiple ROCm library versions this algorithm will need to be adjusted to keep the rocmID numeric value correct
rocmID := 0
for i := 0; i < len(needsDelete); i++ {
if needsDelete[i] {
logutil.Trace("removing unsupported or overlapping GPU combination", "libDir", devices[i].LibraryPath[len(devices[i].LibraryPath)-1], "description", devices[i].Description, "compute", devices[i].Compute(), "pci_id", devices[i].PCIID)
devices = append(devices[:i], devices[i+1:]...)
needsDelete = append(needsDelete[:i], needsDelete[i+1:]...)
i--
} else if devices[i].Library == "ROCm" {
if _, err := strconv.Atoi(devices[i].ID); err == nil {
// Replace the numeric ID with the post-filtered IDs
devices[i].FilteredID = devices[i].ID
devices[i].ID = strconv.Itoa(rocmID)
}
rocmID++
}
}
// Now filter out any overlap with different libraries (favor CUDA/HIP over others)
for i := 0; i < len(devices); i++ {
for j := i + 1; j < len(devices); j++ {
// For this pass, we only drop exact duplicates
switch devices[i].Compare(devices[j]) {
case ml.SameBackendDevice:
// Same library and device, skip it
devices = append(devices[:j], devices[j+1:]...)
j--
continue
case ml.DuplicateDevice:
// Different library, choose based on priority
var droppedDevice ml.DeviceInfo
if devices[i].Library == "CUDA" || devices[i].Library == "ROCm" {
droppedDevice = devices[j]
} else {
droppedDevice = devices[i]
devices[i] = devices[j]
}
devices = append(devices[:j], devices[j+1:]...)
j--
typeStr := "discrete"
if droppedDevice.Integrated {
typeStr = "iGPU"
}
slog.Debug("dropping duplicate device",
"id", droppedDevice.ID,
"library", droppedDevice.Library,
"compute", droppedDevice.Compute(),
"name", droppedDevice.Name,
"description", droppedDevice.Description,
"libdirs", strings.Join(droppedDevice.LibraryPath, ","),
"driver", droppedDevice.Driver(),
"pci_id", droppedDevice.PCIID,
"type", typeStr,
"total", format.HumanBytes2(droppedDevice.TotalMemory),
"available", format.HumanBytes2(droppedDevice.FreeMemory),
)
continue
}
}
}
// Reset the libDirs to what we actually wind up using for future refreshes
libDirs = make(map[string]struct{})
for _, dev := range devices {
dir := dev.LibraryPath[len(dev.LibraryPath)-1]
if dir != LibOllamaPath {
libDirs[dir] = struct{}{}
}
}
if len(libDirs) == 0 {
libDirs[""] = struct{}{}
}
bootstrapped = true
} else {
if runtime.GOOS == "darwin" && runtime.GOARCH == "arm64" {
// metal never updates free VRAM
return devices
}
slog.Debug("refreshing free memory")
updated := make([]bool, len(devices))
allDone := func() bool {
allDone := true
for _, done := range updated {
if !done {
allDone = false
break
}
}
return allDone
}
// First try to use existing runners to refresh VRAM since they're already
// active on GPU(s)
for _, runner := range runners {
if runner == nil {
continue
}
deviceIDs := runner.GetActiveDeviceIDs()
if len(deviceIDs) == 0 {
// Skip this runner since it doesn't have active GPU devices
continue
}
// Check to see if this runner is active on any devices that need a refresh
skip := true
devCheck:
for _, dev := range deviceIDs {
for i := range devices {
if dev == devices[i].DeviceID {
if !updated[i] {
skip = false
break devCheck
}
}
}
}
if skip {
continue
}
// Typical refresh on existing runner is ~500ms but allow longer if the system
// is under stress before giving up and using stale data.
ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
start := time.Now()
updatedDevices := runner.GetDeviceInfos(ctx)
slog.Debug("existing runner discovery took", "duration", time.Since(start))
for _, u := range updatedDevices {
for i := range devices {
if u.DeviceID == devices[i].DeviceID {
updated[i] = true
devices[i].FreeMemory = u.FreeMemory
break
}
}
}
// Short circuit if we've updated all the devices
if allDone() {
break
}
}
if !allDone() {
slog.Debug("unable to refresh all GPUs with existing runners, performing bootstrap discovery")
// Bootstrapping may take longer in some cases (AMD windows), but we
// would rather use stale free data to get the model running sooner
ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
for dir := range libDirs {
updatedDevices := bootstrapDevices(ctx, []string{LibOllamaPath, dir}, nil)
for _, u := range updatedDevices {
for i := range devices {
if u.DeviceID == devices[i].DeviceID {
updated[i] = true
devices[i].FreeMemory = u.FreeMemory
break
}
}
// TODO - consider evaluating if new devices have appeared (e.g. hotplug)
}
if allDone() {
break
}
}
if !allDone() {
slog.Warn("unable to refresh free memory, using old values")
}
}
}
return devices
}
func filterOutVulkanThatAreSupportedByOtherGPU(needsDelete []bool) {
// Filter out Vulkan devices that share a PCI ID with a non-Vulkan device that is not marked for deletion
for i := range devices {
if devices[i].Library != "Vulkan" || needsDelete[i] {
continue
}
if devices[i].PCIID == "" {
continue
}
for j := range devices {
if i == j {
continue
}
if devices[j].PCIID == "" {
continue
}
if devices[j].PCIID == devices[i].PCIID && devices[j].Library != "Vulkan" && !needsDelete[j] {
needsDelete[i] = true
slog.Debug("dropping Vulkan duplicate by PCI ID",
"vulkan_id", devices[i].ID,
"vulkan_libdir", devices[i].LibraryPath[len(devices[i].LibraryPath)-1],
"pci_id", devices[i].PCIID,
"kept_library", devices[j].Library,
"kept_id", devices[j].ID,
)
break
}
}
}
}
func filterOverlapByLibrary(supported map[string]map[string]map[string]int, needsDelete []bool) {
// For multi-GPU systems, use the newest version that supports all the GPUs
for _, byLibDirs := range supported {
libDirs := make([]string, 0, len(byLibDirs))
for libDir := range byLibDirs {
libDirs = append(libDirs, libDir)
}
sort.Sort(sort.Reverse(sort.StringSlice(libDirs)))
anyMissing := false
var newest string
for _, newest = range libDirs {
for _, libDir := range libDirs {
if libDir == newest {
continue
}
if len(byLibDirs[newest]) != len(byLibDirs[libDir]) {
anyMissing = true
break
}
for dev := range byLibDirs[newest] {
if _, found := byLibDirs[libDir][dev]; !found {
anyMissing = true
break
}
}
}
if !anyMissing {
break
}
}
// Now we can mark overlaps for deletion
for _, libDir := range libDirs {
if libDir == newest {
continue
}
for dev, i := range byLibDirs[libDir] {
if _, found := byLibDirs[newest][dev]; found {
needsDelete[i] = true
}
}
}
}
}
type bootstrapRunner struct {
port int
cmd *exec.Cmd
}
func (r *bootstrapRunner) GetPort() int {
return r.port
}
func (r *bootstrapRunner) HasExited() bool {
if r.cmd != nil && r.cmd.ProcessState != nil {
return true
}
return false
}
func bootstrapDevices(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) []ml.DeviceInfo {
var out io.Writer
if envconfig.LogLevel() == logutil.LevelTrace {
out = os.Stderr
}
start := time.Now()
defer func() {
slog.Debug("bootstrap discovery took", "duration", time.Since(start), "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs)
}()
logutil.Trace("starting runner for device discovery", "libDirs", ollamaLibDirs, "extraEnvs", extraEnvs)
cmd, port, err := llm.StartRunner(
true, // ollama engine
"", // no model
ollamaLibDirs,
out,
extraEnvs,
)
if err != nil {
slog.Debug("failed to start runner to discovery GPUs", "error", err)
return nil
}
go func() {
cmd.Wait() // exit status ignored
}()
defer cmd.Process.Kill()
devices, err := ml.GetDevicesFromRunner(ctx, &bootstrapRunner{port: port, cmd: cmd})
if err != nil {
if cmd.ProcessState != nil && cmd.ProcessState.ExitCode() >= 0 {
// Expected during bootstrapping while we filter out unsupported AMD GPUs
logutil.Trace("runner exited", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "code", cmd.ProcessState.ExitCode())
} else {
slog.Info("failure during GPU discovery", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "error", err)
}
}
logutil.Trace("runner enumerated devices", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "devices", devices)
return devices
}

108
discover/runner_test.go Normal file
View File

@@ -0,0 +1,108 @@
package discover
import (
"testing"
"github.com/ollama/ollama/app/lifecycle"
)
func init() {
lifecycle.InitLogging()
}
func TestFilterOverlapByLibrary(t *testing.T) {
type testcase struct {
name string
inp map[string]map[string]map[string]int
exp []bool
}
for _, tc := range []testcase{
{
name: "empty",
inp: map[string]map[string]map[string]int{},
exp: []bool{}, // needs deletion
},
{
name: "single no overlap",
inp: map[string]map[string]map[string]int{
"CUDA": {
"cuda_v12": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 0,
},
},
},
exp: []bool{false},
},
{
name: "100% overlap pick 2nd",
inp: map[string]map[string]map[string]int{
"CUDA": {
"cuda_v12": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 0,
"GPU-cd6c3216-03d2-a8eb-8235-2ffbf571712e": 1,
},
"cuda_v13": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 2,
"GPU-cd6c3216-03d2-a8eb-8235-2ffbf571712e": 3,
},
},
},
exp: []bool{true, true, false, false},
},
{
name: "100% overlap pick 1st",
inp: map[string]map[string]map[string]int{
"CUDA": {
"cuda_v13": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 0,
"GPU-cd6c3216-03d2-a8eb-8235-2ffbf571712e": 1,
},
"cuda_v12": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 2,
"GPU-cd6c3216-03d2-a8eb-8235-2ffbf571712e": 3,
},
},
},
exp: []bool{false, false, true, true},
},
{
name: "partial overlap pick older",
inp: map[string]map[string]map[string]int{
"CUDA": {
"cuda_v13": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 0,
},
"cuda_v12": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 1,
"GPU-cd6c3216-03d2-a8eb-8235-2ffbf571712e": 2,
},
},
},
exp: []bool{true, false, false},
},
{
name: "no overlap",
inp: map[string]map[string]map[string]int{
"CUDA": {
"cuda_v13": {
"GPU-d7b00605-c0c8-152d-529d-e03726d5dc52": 0,
},
"cuda_v12": {
"GPU-cd6c3216-03d2-a8eb-8235-2ffbf571712e": 1,
},
},
},
exp: []bool{false, false},
},
} {
t.Run(tc.name, func(t *testing.T) {
needsDelete := make([]bool, len(tc.exp))
filterOverlapByLibrary(tc.inp, needsDelete)
for i, exp := range tc.exp {
if needsDelete[i] != exp {
t.Fatalf("expected: %v\ngot: %v", tc.exp, needsDelete)
}
}
})
}
}

View File

@@ -1,10 +1,12 @@
package discover
import (
"fmt"
"log/slog"
"path/filepath"
"strings"
"github.com/ollama/ollama/format"
"github.com/ollama/ollama/ml"
)
type memInfo struct {
@@ -13,52 +15,6 @@ type memInfo struct {
FreeSwap uint64 `json:"free_swap,omitempty"` // TODO split this out for system only
}
// Beginning of an `ollama info` command
type GpuInfo struct { // TODO better name maybe "InferenceProcessor"?
memInfo
Library string `json:"library,omitempty"`
// Optional variant to select (e.g. versions, cpu feature flags)
Variant string `json:"variant"`
// MinimumMemory represents the minimum memory required to use the GPU
MinimumMemory uint64 `json:"-"`
// Any extra PATH/LD_LIBRARY_PATH dependencies required for the Library to operate properly
DependencyPath []string `json:"lib_path,omitempty"`
// Extra environment variables specific to the GPU as list of [key,value]
EnvWorkarounds [][2]string `json:"envs,omitempty"`
// Set to true if we can NOT reliably discover FreeMemory. A value of true indicates
// the FreeMemory is best effort, and may over or under report actual memory usage
// False indicates FreeMemory can generally be trusted on this GPU
UnreliableFreeMemory bool
// GPU information
ID string `json:"gpu_id"` // string to use for selection of this specific GPU
Name string `json:"name"` // user friendly name if available
Compute string `json:"compute"` // Compute Capability or gfx
// Driver Information - TODO no need to put this on each GPU
DriverMajor int `json:"driver_major,omitempty"`
DriverMinor int `json:"driver_minor,omitempty"`
// TODO other performance capability info to help in scheduling decisions
}
func (gpu GpuInfo) RunnerName() string {
if gpu.Variant != "" {
return gpu.Library + "_" + gpu.Variant
}
return gpu.Library
}
type CPUInfo struct {
GpuInfo
CPUs []CPU
}
// CPU type represents a CPU Package occupying a socket
type CPU struct {
ID string `cpuinfo:"processor"`
@@ -69,115 +25,47 @@ type CPU struct {
ThreadCount int
}
type CudaGPUInfo struct {
GpuInfo
OSOverhead uint64 // Memory overhead between the driver library and management library
index int //nolint:unused,nolintlint
computeMajor int //nolint:unused,nolintlint
computeMinor int //nolint:unused,nolintlint
}
type CudaGPUInfoList []CudaGPUInfo
type RocmGPUInfo struct {
GpuInfo
usedFilepath string //nolint:unused,nolintlint
index int //nolint:unused,nolintlint
}
type RocmGPUInfoList []RocmGPUInfo
type OneapiGPUInfo struct {
GpuInfo
driverIndex int //nolint:unused,nolintlint
gpuIndex int //nolint:unused,nolintlint
}
type OneapiGPUInfoList []OneapiGPUInfo
type GpuInfoList []GpuInfo
type UnsupportedGPUInfo struct {
GpuInfo
Reason string `json:"reason"`
}
// Split up the set of gpu info's by Library and variant
func (l GpuInfoList) ByLibrary() []GpuInfoList {
resp := []GpuInfoList{}
libs := []string{}
for _, info := range l {
found := false
requested := info.Library
if info.Variant != "" {
requested += "_" + info.Variant
}
for i, lib := range libs {
if lib == requested {
resp[i] = append(resp[i], info)
found = true
break
func LogDetails(devices []ml.DeviceInfo) {
for _, dev := range devices {
var libs []string
for _, dir := range dev.LibraryPath {
if strings.Contains(dir, filepath.Join("lib", "ollama")) {
libs = append(libs, filepath.Base(dir))
}
}
if !found {
libs = append(libs, requested)
resp = append(resp, []GpuInfo{info})
typeStr := "discrete"
if dev.Integrated {
typeStr = "iGPU"
}
}
return resp
}
// Report the GPU information into the log an Info level
func (l GpuInfoList) LogDetails() {
for _, g := range l {
slog.Info("inference compute",
"id", g.ID,
"library", g.Library,
"variant", g.Variant,
"compute", g.Compute,
"driver", fmt.Sprintf("%d.%d", g.DriverMajor, g.DriverMinor),
"name", g.Name,
"total", format.HumanBytes2(g.TotalMemory),
"available", format.HumanBytes2(g.FreeMemory),
"id", dev.ID,
"library", dev.Library,
"compute", dev.Compute(),
"name", dev.Name,
"description", dev.Description,
"libdirs", strings.Join(libs, ","),
"driver", dev.Driver(),
"pci_id", dev.PCIID,
"type", typeStr,
"total", format.HumanBytes2(dev.TotalMemory),
"available", format.HumanBytes2(dev.FreeMemory),
)
}
// CPU inference
if len(devices) == 0 {
dev, _ := GetCPUMem()
slog.Info("inference compute",
"id", "cpu",
"library", "cpu",
"compute", "",
"name", "cpu",
"description", "cpu",
"libdirs", "ollama",
"driver", "",
"pci_id", "",
"type", "",
"total", format.HumanBytes2(dev.TotalMemory),
"available", format.HumanBytes2(dev.FreeMemory),
)
}
}
// Sort by Free Space
type ByFreeMemory []GpuInfo
func (a ByFreeMemory) Len() int { return len(a) }
func (a ByFreeMemory) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a ByFreeMemory) Less(i, j int) bool { return a[i].FreeMemory < a[j].FreeMemory }
type SystemInfo struct {
System CPUInfo `json:"system"`
GPUs []GpuInfo `json:"gpus"`
UnsupportedGPUs []UnsupportedGPUInfo `json:"unsupported_gpus"`
DiscoveryErrors []string `json:"discovery_errors"`
}
// Return the optimal number of threads to use for inference
func (si SystemInfo) GetOptimalThreadCount() int {
if len(si.System.CPUs) == 0 {
return 0
}
coreCount := 0
for _, c := range si.System.CPUs {
coreCount += c.CoreCount - c.EfficiencyCoreCount
}
return coreCount
}
// For each GPU, check if it does NOT support flash attention
func (l GpuInfoList) FlashAttentionSupported() bool {
for _, gpu := range l {
supportsFA := gpu.Library == "metal" ||
(gpu.Library == "cuda" && gpu.DriverMajor >= 7) ||
gpu.Library == "rocm"
if !supportsFA {
return false
}
}
return true
}

View File

@@ -1593,7 +1593,7 @@ Then there is a series of downloading responses. Until any of the download is co
```json
{
"status": "downloading digestname",
"status": "pulling digestname",
"digest": "digestname",
"total": 2142590208,
"completed": 241970
@@ -1708,6 +1708,7 @@ Advanced parameters:
- `truncate`: truncates the end of each input to fit within context length. Returns error if `false` and context length is exceeded. Defaults to `true`
- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
- `dimensions`: number of dimensions for the embedding
### Examples

40
docs/cloud.md Normal file
View File

@@ -0,0 +1,40 @@
# Cloud
| Ollama's cloud is currently in preview. For full documentation, see [Ollama's documentation](https://docs.ollama.com/cloud).
## Cloud Models
[Cloud models](https://ollama.com/cloud) are a new kind of model in Ollama that can run without a powerful GPU. Instead, cloud models are automatically offloaded to Ollama's cloud while offering the same capabilities as local models, making it possible to keep using your local tools while running larger models that wouldnt fit on a personal computer.
Ollama currently supports the following cloud models, with more coming soon:
- `gpt-oss:20b-cloud`
- `gpt-oss:120b-cloud`
- `deepseek-v3.1:671b-cloud`
- `qwen3-coder:480b-cloud`
### Get started
To run a cloud model, open the terminal and run:
```
ollama run gpt-oss:120b-cloud
```
To run cloud models with integrations that work with Ollama, first download the cloud model:
```
ollama pull qwen3-coder:480b-cloud
```
Then sign in to Ollama:
```
ollama signin
```
Finally, access the model using the model name `qwen3-coder:480b-cloud` via Ollama's local API or tooling.
## Cloud API access
Cloud models can also be accessed directly on ollama.com's API. For more information, see the [docs](https://docs.ollama.com/cloud).

View File

@@ -11,6 +11,10 @@ Then build and run Ollama from the root directory of the repository:
go run . serve
```
> [!NOTE]
> Ollama includes native code compiled with CGO. From time to time these data structures can change and CGO can get out of sync resulting in unexpected crashes. You can force a full build of the native code by running `go clean -cache` first.
## macOS (Apple Silicon)
macOS Apple Silicon supports Metal which is built-in to the Ollama binary. No additional steps are required.

View File

@@ -20,9 +20,9 @@ Please refer to the [GPU docs](./gpu.md).
## How can I specify the context window size?
By default, Ollama uses a context window size of 4096 tokens.
By default, Ollama uses a context window size of 4096 tokens for most models. The `gpt-oss` model has a default context window size of 8192 tokens.
This can be overridden with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use:
This can be overridden in Settings in the Windows and macOS App, or with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use:
```shell
OLLAMA_CONTEXT_LENGTH=8192 ollama serve
@@ -46,6 +46,8 @@ curl http://localhost:11434/api/generate -d '{
}'
```
Setting the context length higher may cause the model to not be able to fit onto the GPU which make the model run more slowly.
## How can I tell if my model was loaded onto the GPU?
Use the `ollama ps` command to see what models are currently loaded into memory.
@@ -57,8 +59,8 @@ ollama ps
> **Output**:
>
> ```
> NAME ID SIZE PROCESSOR UNTIL
> llama3:70b bcfb190ca3a7 42 GB 100% GPU 4 minutes from now
> NAME ID SIZE PROCESSOR CONTEXT UNTIL
> gpt-oss:20b 05afbac4bad6 16 GB 100% GPU 8192 4 minutes from now
> ```
The `Processor` column will show which memory the model was loaded in to:
@@ -148,9 +150,11 @@ docker build -t ollama-with-ca .
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
```
## Does Ollama send my prompts and answers back to ollama.com?
## Does Ollama send my prompts and responses back to ollama.com?
No. Ollama runs locally, and conversation data does not leave your machine.
If you're running a model locally, your prompts and responses will always stay on your machine. Ollama Turbo in the App allows you to run your queries on Ollama's servers if you don't have a powerful enough GPU. Web search lets a model query the web, giving you more accurate and up-to-date information. Both Turbo and web search require sending your prompts and responses to Ollama.com. This data is neither logged nor stored.
If you don't want to see the Turbo and web search options in the app, you can disable them in Settings by turning on Airplane mode. In Airplane mode, all models will run locally, and your prompts and responses will stay on your machine.
## How can I expose Ollama on my network?
@@ -345,4 +349,4 @@ Ollama for Windows and macOS register as a login item during installation. You
- Open `Settings` -> `Users & Groups` -> `Login Items` and find the `Ollama` entry, then click the `-` (minus) to remove
**MacOS Ventura (v13) and later**
- Open `Settings` and search for "Login Items", find the `Ollama` entry under "Allow in the Background`, then click the slider to disable.
- Open `Settings` and search for "Login Items", find the `Ollama` entry under "Allow in the Background`, then click the slider to disable.

View File

@@ -9,15 +9,20 @@ Check your compute compatibility to see if your card is supported:
| ------------------ | ------------------- | ----------------------------------------------------------------------------------------------------------- |
| 12.0 | GeForce RTX 50xx | `RTX 5060` `RTX 5060 Ti` `RTX 5070` `RTX 5070 Ti` `RTX 5080` `RTX 5090` |
| | NVIDIA Professioal | `RTX PRO 4000 Blackwell` `RTX PRO 4500 Blackwell` `RTX PRO 5000 Blackwell` `RTX PRO 6000 Blackwell` |
| 9.0 | NVIDIA | `H200` `H100` |
| 11.0 | Jetson | `T4000` `T5000` (Requires driver 580 or newer) |
| 10.3 | NVIDIA Professioal | `B300` `GB300` (Requires driver 580 or newer) |
| 10.0 | NVIDIA Professioal | `B200` `GB200` (Requires driver 580 or newer) |
| 9.0 | NVIDIA | `H200` `H100` `GH200` |
| 8.9 | GeForce RTX 40xx | `RTX 4090` `RTX 4080 SUPER` `RTX 4080` `RTX 4070 Ti SUPER` `RTX 4070 Ti` `RTX 4070 SUPER` `RTX 4070` `RTX 4060 Ti` `RTX 4060` |
| | NVIDIA Professional | `L4` `L40` `RTX 6000` |
| 8.7 | Jetson | `Orin Nano` `Orin NX` `AGX Orin` |
| 8.6 | GeForce RTX 30xx | `RTX 3090 Ti` `RTX 3090` `RTX 3080 Ti` `RTX 3080` `RTX 3070 Ti` `RTX 3070` `RTX 3060 Ti` `RTX 3060` `RTX 3050 Ti` `RTX 3050` |
| | NVIDIA Professional | `A40` `RTX A6000` `RTX A5000` `RTX A4000` `RTX A3000` `RTX A2000` `A10` `A16` `A2` |
| 8.0 | NVIDIA | `A100` `A30` |
| 7.5 | GeForce GTX/RTX | `GTX 1650 Ti` `TITAN RTX` `RTX 2080 Ti` `RTX 2080` `RTX 2070` `RTX 2060` |
| | NVIDIA Professional | `T4` `RTX 5000` `RTX 4000` `RTX 3000` `T2000` `T1200` `T1000` `T600` `T500` |
| | Quadro | `RTX 8000` `RTX 6000` `RTX 5000` `RTX 4000` |
| 7.2 | Jetson | `Xavier NX` `AGX Xavier` (Jetpack 5) |
| 7.0 | NVIDIA | `TITAN V` `V100` `Quadro GV100` |
| 6.1 | NVIDIA TITAN | `TITAN Xp` `TITAN X` |
| | GeForce GTX | `GTX 1080 Ti` `GTX 1080` `GTX 1070 Ti` `GTX 1070` `GTX 1060` `GTX 1050 Ti` `GTX 1050` |
@@ -51,20 +56,23 @@ sudo modprobe nvidia_uvm`
Ollama supports the following AMD GPUs:
### Linux Support
| Family | Cards and accelerators |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| AMD Radeon RX | `7900 XTX` `7900 XT` `7900 GRE` `7800 XT` `7700 XT` `7600 XT` `7600` `6950 XT` `6900 XTX` `6900XT` `6800 XT` `6800` `Vega 64` `Vega 56` |
| AMD Radeon PRO | `W7900` `W7800` `W7700` `W7600` `W7500` `W6900X` `W6800X Duo` `W6800X` `W6800` `V620` `V420` `V340` `V320` `Vega II Duo` `Vega II` `VII` `SSG` |
| AMD Instinct | `MI300X` `MI300A` `MI300` `MI250X` `MI250` `MI210` `MI200` `MI100` `MI60` `MI50` |
| Family | Cards and accelerators |
| -------------- | -------------------------------------------------------------------------------------------------------------------- |
| AMD Radeon RX | `7900 XTX` `7900 XT` `7900 GRE` `7800 XT` `7700 XT` `7600 XT` `7600` `6950 XT` `6900 XTX` `6900XT` `6800 XT` `6800` |
| AMD Radeon PRO | `W7900` `W7800` `W7700` `W7600` `W7500` `W6900X` `W6800X Duo` `W6800X` `W6800` `V620` `V420` `V340` `V320` |
| AMD Instinct | `MI300X` `MI300A` `MI300` `MI250X` `MI250` `MI210` `MI200` `MI100` |
### Windows Support
With ROCm v6.1, the following GPUs are supported on Windows.
With ROCm v6.2, the following GPUs are supported on Windows.
| Family | Cards and accelerators |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| AMD Radeon RX | `7900 XTX` `7900 XT` `7900 GRE` `7800 XT` `7700 XT` `7600 XT` `7600` `6950 XT` `6900 XTX` `6900XT` `6800 XT` `6800` |
| AMD Radeon PRO | `W7900` `W7800` `W7700` `W7600` `W7500` `W6900X` `W6800X Duo` `W6800X` `W6800` `V620` |
### Known Workarounds
- The RX Vega 56 requires `HSA_ENABLE_SDMA=0` to disable SDMA
### Overrides on Linux
Ollama leverages the AMD ROCm library, which does not support all AMD GPUs. In
@@ -85,8 +93,6 @@ At this time, the known supported GPU types on linux are the following LLVM Targ
This table shows some example GPUs that map to these LLVM targets:
| **LLVM Target** | **An Example GPU** |
|-----------------|---------------------|
| gfx900 | Radeon RX Vega 56 |
| gfx906 | Radeon Instinct MI50 |
| gfx908 | Radeon Instinct MI100 |
| gfx90a | Radeon Instinct MI210 |
| gfx940 | Radeon Instinct MI300 |

View File

@@ -11,12 +11,13 @@ curl -fsSL https://ollama.com/install.sh | sh
## Manual install
> [!NOTE]
> If you are upgrading from a prior version, you should remove the old libraries with `sudo rm -rf /usr/lib/ollama` first.
> If you are upgrading from a prior version, you **MUST** remove the old libraries with `sudo rm -rf /usr/lib/ollama` first.
Download and extract the package:
```shell
curl -LO https://ollama.com/download/ollama-linux-amd64.tgz
sudo rm -rf /usr/lib/ollama
sudo tar -C /usr -xzf ollama-linux-amd64.tgz
```
@@ -34,7 +35,11 @@ ollama -v
### AMD GPU install
If you have an AMD GPU, also download and extract the additional ROCm package:
If you have an AMD GPU, **also** download and extract the additional ROCm package:
> [!IMPORTANT]
> The ROCm tgz contains only AMD dependent libraries. You must extract **both** `ollama-linux-amd64.tgz` and `ollama-linux-amd64-rocm.tgz` into the same location.
```shell
curl -L https://ollama.com/download/ollama-linux-amd64-rocm.tgz -o ollama-linux-amd64-rocm.tgz

View File

@@ -2,7 +2,7 @@
## System Requirements
* MacOS Monterey (v12) or newer
* MacOS Sonoma (v14) or newer
* Apple M series (CPU and GPU support) or x86 (CPU only)

View File

@@ -38,26 +38,14 @@ Join the [Discord](https://discord.gg/ollama) for help interpreting the logs.
## LLM libraries
Ollama includes multiple LLM libraries compiled for different GPUs and CPU vector features. Ollama tries to pick the best one based on the capabilities of your system. If this autodetection has problems, or you run into other problems (e.g. crashes in your GPU) you can workaround this by forcing a specific LLM library. `cpu_avx2` will perform the best, followed by `cpu_avx` and the slowest but most compatible is `cpu`. Rosetta emulation under MacOS will work with the `cpu` library.
In the server log, you will see a message that looks something like this (varies from release to release):
```
Dynamic LLM libraries [rocm_v6 cpu cpu_avx cpu_avx2 cuda_v12 rocm_v5]
```
Ollama includes multiple LLM libraries compiled for different GPU libraries and versions. Ollama tries to pick the best one based on the capabilities of your system. If this autodetection has problems, or you run into other problems (e.g. crashes in your GPU) you can workaround this by forcing a specific LLM library.
**Experimental LLM Library Override**
You can set OLLAMA_LLM_LIBRARY to any of the available LLM libraries to bypass autodetection, so for example, if you have a CUDA card, but want to force the CPU LLM library with AVX2 vector support, use:
You can set OLLAMA_LLM_LIBRARY to any of the available LLM libraries to limit autodetection, so for example, if you have both CUDA and AMD GPUs, but want to force the CUDA v13 only, use:
```shell
OLLAMA_LLM_LIBRARY="cpu_avx2" ollama serve
```
You can see what features your CPU has with the following.
```shell
cat /proc/cpuinfo| grep flags | head -1
OLLAMA_LLM_LIBRARY="cuda_v13" ollama serve
```
## Installing older or pre-release versions on Linux
@@ -92,6 +80,9 @@ If none of those resolve the problem, gather additional information and file an
- Set `CUDA_ERROR_LEVEL=50` and try again to get more diagnostic logs
- Check dmesg for any errors `sudo dmesg | grep -i nvrm` and `sudo dmesg | grep -i nvidia`
You may get more details for initialization failures by enabling debug prints in the uvm driver. You should only use this temporarily while troubleshooting
- `sudo rmmod nvidia_uvm` then `sudo modprobe nvidia_uvm uvm_debug_prints=1`
## AMD GPU Discovery

View File

@@ -68,9 +68,9 @@ If you'd like to install or integrate Ollama as a service, a standalone
`ollama-windows-amd64.zip` zip file is available containing only the Ollama CLI
and GPU library dependencies for Nvidia. If you have an AMD GPU, also download
and extract the additional ROCm package `ollama-windows-amd64-rocm.zip` into the
same directory. This allows for embedding Ollama in existing applications, or
running it as a system service via `ollama serve` with tools such as
[NSSM](https://nssm.cc/).
same directory. Both zip files are necessary for a complete AMD installation.
This allows for embedding Ollama in existing applications, or running it as a
system service via `ollama serve` with tools such as [NSSM](https://nssm.cc/).
> [!NOTE]
> If you are upgrading from a prior version, you should remove the old directories first.

View File

@@ -24,6 +24,9 @@ func Host() *url.URL {
switch {
case !ok:
scheme, hostport = "http", s
if s == "ollama.com" {
scheme, hostport = "https", "ollama.com:443"
}
case scheme == "http":
defaultPort = "80"
case scheme == "https":
@@ -134,8 +137,19 @@ func LoadTimeout() (loadTimeout time.Duration) {
return loadTimeout
}
func Bool(k string) func() bool {
return func() bool {
func Remotes() []string {
var r []string
raw := strings.TrimSpace(Var("OLLAMA_REMOTES"))
if raw == "" {
r = []string{"ollama.com"}
} else {
r = strings.Split(raw, ",")
}
return r
}
func BoolWithDefault(k string) func(defaultValue bool) bool {
return func(defaultValue bool) bool {
if s := Var(k); s != "" {
b, err := strconv.ParseBool(s)
if err != nil {
@@ -145,7 +159,14 @@ func Bool(k string) func() bool {
return b
}
return false
return defaultValue
}
}
func Bool(k string) func() bool {
withDefault := BoolWithDefault(k)
return func() bool {
return withDefault(false)
}
}
@@ -166,7 +187,7 @@ func LogLevel() slog.Level {
var (
// FlashAttention enables the experimental flash attention feature.
FlashAttention = Bool("OLLAMA_FLASH_ATTENTION")
FlashAttention = BoolWithDefault("OLLAMA_FLASH_ATTENTION")
// KvCacheType is the quantization type for the K/V cache.
KvCacheType = String("OLLAMA_KV_CACHE_TYPE")
// NoHistory disables readline history.
@@ -199,6 +220,7 @@ var (
CudaVisibleDevices = String("CUDA_VISIBLE_DEVICES")
HipVisibleDevices = String("HIP_VISIBLE_DEVICES")
RocrVisibleDevices = String("ROCR_VISIBLE_DEVICES")
VkVisibleDevices = String("GGML_VK_VISIBLE_DEVICES")
GpuDeviceOrdinal = String("GPU_DEVICE_ORDINAL")
HsaOverrideGfxVersion = String("HSA_OVERRIDE_GFX_VERSION")
)
@@ -252,7 +274,7 @@ type EnvVar struct {
func AsMap() map[string]EnvVar {
ret := map[string]EnvVar{
"OLLAMA_DEBUG": {"OLLAMA_DEBUG", LogLevel(), "Show additional debug information (e.g. OLLAMA_DEBUG=1)"},
"OLLAMA_FLASH_ATTENTION": {"OLLAMA_FLASH_ATTENTION", FlashAttention(), "Enabled flash attention"},
"OLLAMA_FLASH_ATTENTION": {"OLLAMA_FLASH_ATTENTION", FlashAttention(false), "Enabled flash attention"},
"OLLAMA_KV_CACHE_TYPE": {"OLLAMA_KV_CACHE_TYPE", KvCacheType(), "Quantization type for the K/V cache (default: f16)"},
"OLLAMA_GPU_OVERHEAD": {"OLLAMA_GPU_OVERHEAD", GpuOverhead(), "Reserve a portion of VRAM per GPU (bytes)"},
"OLLAMA_HOST": {"OLLAMA_HOST", Host(), "IP Address for the ollama server (default 127.0.0.1:11434)"},
@@ -270,6 +292,7 @@ func AsMap() map[string]EnvVar {
"OLLAMA_MULTIUSER_CACHE": {"OLLAMA_MULTIUSER_CACHE", MultiUserCache(), "Optimize prompt caching for multi-user scenarios"},
"OLLAMA_CONTEXT_LENGTH": {"OLLAMA_CONTEXT_LENGTH", ContextLength(), "Context length to use unless otherwise specified (default: 4096)"},
"OLLAMA_NEW_ENGINE": {"OLLAMA_NEW_ENGINE", NewEngine(), "Enable the new Ollama engine"},
"OLLAMA_REMOTES": {"OLLAMA_REMOTES", Remotes(), "Allowed hosts for remote models (default \"ollama.com\")"},
// Informational
"HTTP_PROXY": {"HTTP_PROXY", String("HTTP_PROXY")(), "HTTP proxy"},
@@ -288,6 +311,7 @@ func AsMap() map[string]EnvVar {
ret["CUDA_VISIBLE_DEVICES"] = EnvVar{"CUDA_VISIBLE_DEVICES", CudaVisibleDevices(), "Set which NVIDIA devices are visible"}
ret["HIP_VISIBLE_DEVICES"] = EnvVar{"HIP_VISIBLE_DEVICES", HipVisibleDevices(), "Set which AMD devices are visible by numeric ID"}
ret["ROCR_VISIBLE_DEVICES"] = EnvVar{"ROCR_VISIBLE_DEVICES", RocrVisibleDevices(), "Set which AMD devices are visible by UUID or numeric ID"}
ret["GGML_VK_VISIBLE_DEVICES"] = EnvVar{"GGML_VK_VISIBLE_DEVICES", VkVisibleDevices(), "Set which Vulkan devices are visible by numeric ID"}
ret["GPU_DEVICE_ORDINAL"] = EnvVar{"GPU_DEVICE_ORDINAL", GpuDeviceOrdinal(), "Set which AMD devices are visible by numeric ID"}
ret["HSA_OVERRIDE_GFX_VERSION"] = EnvVar{"HSA_OVERRIDE_GFX_VERSION", HsaOverrideGfxVersion(), "Override the gfx used for all detected AMD GPUs"}
ret["OLLAMA_INTEL_GPU"] = EnvVar{"OLLAMA_INTEL_GPU", IntelGPU(), "Enable experimental Intel GPU detection"}

View File

@@ -37,6 +37,7 @@ func TestHost(t *testing.T) {
"https": {"https://1.2.3.4", "https://1.2.3.4:443"},
"https port": {"https://1.2.3.4:4321", "https://1.2.3.4:4321"},
"proxy path": {"https://example.com/ollama", "https://example.com:443/ollama"},
"ollama.com": {"ollama.com", "https://ollama.com:443"},
}
for name, tt := range cases {

View File

@@ -1,14 +1,17 @@
package ggml
import (
"cmp"
"encoding/binary"
"errors"
"fmt"
"io"
"log/slog"
"math"
"slices"
"strings"
"github.com/ollama/ollama/format"
"github.com/ollama/ollama/fs/util/bufioutil"
)
@@ -54,10 +57,28 @@ func (kv KV) EmbeddingLength() uint64 {
return uint64(kv.Uint("embedding_length"))
}
func (kv KV) HeadCount() []uint64 {
headCountDefault := uint32(1)
headCount := kv.UintOrArrayValueAsArray("attention.head_count", headCountDefault)
if len(headCount) == 1 {
headCountDefault = headCount[0]
}
nLayers := int(kv.BlockCount())
if len(headCount) > nLayers {
slog.Warn("got more elements of attention.head_count than layers", "len(headCount)", len(headCount), "layers", nLayers)
}
out := make([]uint64, nLayers)
for i := range nLayers {
if i >= len(headCount) {
out[i] = uint64(headCountDefault)
} else {
out[i] = uint64(headCount[i])
}
}
return out
}
func (kv KV) HeadCountMax() uint64 {
// TODO(drifkin): using the max value can cause an overestimation. In the
// future if array values become more popular, we can adapt the more invasive
// <https://github.com/ollama/ollama/pull/10225>
return uint64(kv.UintOrMaxArrayValue("attention.head_count", 1))
}
@@ -65,6 +86,27 @@ func (kv KV) HeadCountMin() uint64 {
return uint64(kv.UintOrMinArrayValue("attention.head_count", 1))
}
func (kv KV) HeadCountKV() []uint64 {
headCountKVDefault := uint32(1)
headCountKV := kv.UintOrArrayValueAsArray("attention.head_count_kv", headCountKVDefault)
if len(headCountKV) == 1 {
headCountKVDefault = headCountKV[0]
}
nLayers := int(kv.BlockCount())
if len(headCountKV) > nLayers {
slog.Warn("got more elements of attention.head_count than layers", "len(headCountKV)", len(headCountKV), "layers", nLayers)
}
out := make([]uint64, nLayers)
for i := range nLayers {
if i >= len(headCountKV) {
out[i] = uint64(headCountKVDefault)
} else {
out[i] = uint64(headCountKV[i])
}
}
return out
}
func (kv KV) HeadCountKVMax() uint64 {
return uint64(kv.UintOrMaxArrayValue("attention.head_count_kv", 1))
}
@@ -97,6 +139,26 @@ func (kv KV) ChatTemplate() string {
return kv.String("tokenizer.chat_template")
}
// ssm architecture parameters
func (kv KV) SSMConvKernel() uint64 {
return uint64(kv.Uint("ssm.conv_kernel"))
}
func (kv KV) SSMInnerSize() uint64 {
return uint64(kv.Uint("ssm.inner_size"))
}
func (kv KV) SSMStateSize() uint64 {
return uint64(kv.Uint("ssm.state_size"))
}
func (kv KV) SSMGroupCount() uint64 {
return uint64(kv.Uint("ssm.group_count"))
}
// general types
func (kv KV) String(key string, defaultValue ...string) string {
val, _ := keyValue(kv, key, append(defaultValue, "")...)
return val
@@ -128,22 +190,27 @@ func (kv KV) UintOrMinArrayValue(key string, defaultValue uint32) uint32 {
}
func (kv KV) UintOrArrayValue(key string, defaultValue uint32) (uint32, uint32) {
arrVal := kv.UintOrArrayValueAsArray(key, defaultValue)
return slices.Min(arrVal), slices.Max(arrVal)
}
func (kv KV) UintOrArrayValueAsArray(key string, defaultValue uint32) []uint32 {
if u32, ok := keyValue(kv, key, uint32(0)); ok {
return u32, u32
return []uint32{u32}
} else if u32s, ok := keyValue(kv, key, &array[uint32]{}); ok {
min := slices.Min(u32s.values)
max := slices.Max(u32s.values)
return min, max
return u32s.values
} else if i32s, ok := keyValue(kv, key, &array[int32]{}); ok {
min := slices.Min(i32s.values)
max := slices.Max(i32s.values)
if min < 0 || max < 0 {
slog.Warn("array values are unexpectedly negative", "key", key, "min", min, "max", max)
dst := make([]uint32, len(i32s.values))
for i, v := range i32s.values {
if v < 0 {
slog.Warn("array values are unexpectedly negative", "key", key, "i", i, "v", v)
}
dst[i] = uint32(v)
}
return uint32(min), uint32(max)
return dst
}
return defaultValue, defaultValue
return []uint32{defaultValue}
}
func (kv KV) Strings(key string, defaultValue ...[]string) []string {
@@ -176,9 +243,12 @@ func (kv KV) OllamaEngineRequired() bool {
"gemma3",
"gemma3n",
"mistral3",
"qwen3",
"qwen3moe",
"llama4",
"mllama",
"qwen25vl",
"gptoss", "gpt-oss",
}, kv.Architecture())
}
@@ -273,36 +343,37 @@ type Tensor struct {
func (t Tensor) block() (n int) {
if _, err := fmt.Sscanf(t.Name, "blk.%d.", &n); err != nil {
return -1
return math.MaxInt
}
return
}
func (t Tensor) blockSize() uint64 {
return (TensorType)(t.Kind).BlockSize()
return TensorType(t.Kind).BlockSize()
}
func (t TensorType) BlockSize() uint64 {
switch t {
case
0, // F32
1, // F16
24, // I8
25, // I16
26, // I32
27, // I64
28, // F64
30: // BF16
TensorTypeF32,
TensorTypeF16,
TensorTypeI8,
TensorTypeI16,
TensorTypeI32,
TensorTypeI64,
TensorTypeF64,
TensorTypeBF16:
return 1
case
2, // Q4_0
3, // Q4_1
6, // Q5_0
7, // Q5_1
8, // Q8_0
9, // Q8_1
20: // IQ4_NL
TensorTypeQ4_0,
TensorTypeQ4_1,
TensorTypeQ5_0,
TensorTypeQ5_1,
TensorTypeQ8_0,
TensorTypeQ8_1,
tensorTypeIQ4_NL,
4, TensorTypeMXFP4:
return 32
default:
return 256
@@ -375,6 +446,8 @@ func (t TensorType) TypeSize() uint64 {
return blockSize/8 + blockSize/16 + blockSize/32
case TensorTypeBF16:
return 2
case 4, TensorTypeMXFP4:
return 1 + blockSize/2
default:
return 0
}
@@ -474,10 +547,14 @@ func Decode(rs io.ReadSeeker, maxArraySize int) (*GGML, error) {
}, nil
}
func (f GGML) GraphSize(context, batch uint64, numParallel int, kvCacheType string) (kv []uint64, partialOffload, fullOffload uint64) {
func (f GGML) GraphSize(context, batch uint64, numParallel int, kvCacheType string, useFlashAttention bool) (kv []uint64, partialOffload, fullOffload uint64) {
context *= uint64(numParallel)
embedding := f.KV().EmbeddingLength()
heads := f.KV().HeadCountMax()
headsArr := f.KV().HeadCount()
headsKV := f.KV().HeadCountKVMax()
headsKVArr := f.KV().HeadCountKV()
vocab := uint64(f.KV()["tokenizer.ggml.tokens"].(*array[string]).size)
embeddingHeads := f.KV().EmbeddingHeadCountMax()
@@ -487,10 +564,51 @@ func (f GGML) GraphSize(context, batch uint64, numParallel int, kvCacheType stri
layers := f.Tensors().GroupLayers()
bytesPerElement := kvCacheBytesPerElement(kvCacheType)
// Default for models unless special-cased below. These defaults mirror the
// cache usage in llama.cpp under the assumption that models without special
// cases below will use the llamarunner and caching will be handled by the
// llama.cpp layer.
//
// This also assumes that a layer without heads or headsKV set is recurrent
// which is usually the case. Some models (eg nemotronh) use "blocks" in
// place of layers where some are MLP blocks that don't have any cache.
// Models like this will need a special case below to be accurately
// estimated.
var kvTotal uint64
kv = make([]uint64, f.KV().BlockCount())
kvSizeAttn := uint64(0)
kvSizeRecurrent := uint64(0)
for i := range kv {
kv[i] = uint64(float64(context*(embeddingHeadsK+embeddingHeadsV)*headsKV) * bytesPerElement)
headsL := headsArr[i]
headsKVL := headsKVArr[i]
if headsL > 0 && headsKVL > 0 {
// full attention layer
// NOTE: Assumes uniform values for all attn layers
kv[i] = uint64(float64(context*(embeddingHeadsK+embeddingHeadsV)*headsKVL) * bytesPerElement)
kvSizeAttn += kv[i]
} else {
// recurrent layer
ssmDConv := f.KV().SSMConvKernel()
ssmDState := f.KV().SSMStateSize()
ssmDInner := f.KV().SSMInnerSize()
ssmNGroups := f.KV().SSMGroupCount()
nEmbdR := uint64(0)
if ssmDConv > 0 {
nEmbdR = (ssmDConv - 1) * (ssmDInner + 2*ssmNGroups*ssmDState)
}
nEmbdS := ssmDState * ssmDInner
// recurrent always uses F32 in llama.cpp backend
// https://github.com/ggml-org/llama.cpp/blob/master/src/llama-model.cpp#L18644
bytesPerElementRecurrent := kvCacheBytesPerElement("f32")
kv[i] = (nEmbdR + nEmbdS) * uint64(bytesPerElementRecurrent)
kvSizeRecurrent += kv[i]
}
kvTotal += kv[i]
}
slog.Debug("default cache size estimate", "attention MiB", float32(kvSizeAttn)/(1024.*1024.), "attention bytes", kvSizeAttn, "recurrent MiB", float32(kvSizeRecurrent)/(1024.*1024.), "recurrent bytes", kvSizeRecurrent)
switch f.KV().Architecture() {
case "llama", "llama4":
@@ -658,6 +776,22 @@ func (f GGML) GraphSize(context, batch uint64, numParallel int, kvCacheType stri
4*qkvBias.Shape[0],
)
}
case "gptoss", "gpt-oss":
kv = make([]uint64, f.KV().BlockCount())
for i := range kv {
kv[i] = uint64(float64((embeddingHeadsK+embeddingHeadsV)*headsKV) * bytesPerElement)
if i%2 == 0 {
kv[i] *= (uint64(numParallel)*4096 + batch)
} else {
kv[i] *= context
}
}
partialOffload = 2 * f.KV().HeadCountMax() / cmp.Or(f.KV().HeadCountKVMin(), 1) * kvTotal / 6
if useFlashAttention {
// rough estimate of graph size with flash attention on
partialOffload = (4*uint64(numParallel) + context>>10 + 110) * format.MebiByte
}
}
return
@@ -732,7 +866,11 @@ func (llm GGML) VisionGraphSize() (weights, graphSize uint64) {
// SupportsKVCacheType checks if the requested cache type is supported
func (f GGML) SupportsKVCacheType(cacheType string) bool {
return slices.Contains([]string{"f16", "q8_0", "q4_0"}, cacheType)
if cacheType == "" || cacheType == "f16" {
return true
}
return slices.Contains([]string{"q8_0", "q4_0"}, cacheType)
}
// SupportsFlashAttention checks if the model supports flash attention
@@ -742,12 +880,26 @@ func (f GGML) SupportsFlashAttention() bool {
return false
}
if arch := f.KV().Architecture(); slices.Contains([]string{"gemma2"}, arch) {
return false
}
// Check head counts match and are non-zero
headCountK := f.KV().EmbeddingHeadCountK()
headCountV := f.KV().EmbeddingHeadCountV()
return headCountK != 0 && headCountV != 0 && headCountK == headCountV
}
// FlashAttention checks if the model should enable flash attention
func (f GGML) FlashAttention() bool {
return slices.Contains([]string{
"gemma3",
"gptoss", "gpt-oss",
"qwen3",
"qwen3moe",
}, f.KV().String("general.architecture"))
}
// kvCacheBytesPerElement returns the number of bytes per element for a given KV cache type
func kvCacheBytesPerElement(cacheType string) float64 {
switch cacheType {
@@ -755,6 +907,8 @@ func kvCacheBytesPerElement(cacheType string) float64 {
return 1 // 1/2 of fp16
case "q4_0":
return 0.5 // 1/4 of fp16
case "f32":
return 4 // f32 (default for recurrent)
default:
return 2 // f16 (default)
}

View File

@@ -509,7 +509,10 @@ func writeGGUFArray[S ~[]E, E any](w io.Writer, t uint32, s S) error {
}
func WriteGGUF(f *os.File, kv KV, ts []*Tensor) error {
alignment := kv.Uint("general.alignment", 32)
arch := kv.String("general.architecture")
if arch == "" {
return fmt.Errorf("architecture not set")
}
if err := binary.Write(f, binary.LittleEndian, []byte("GGUF")); err != nil {
return err
@@ -528,17 +531,22 @@ func WriteGGUF(f *os.File, kv KV, ts []*Tensor) error {
}
for _, key := range slices.Sorted(maps.Keys(kv)) {
if err := ggufWriteKV(f, key, kv[key]); err != nil {
if err := ggufWriteKV(f, arch, key, kv[key]); err != nil {
return err
}
}
slices.SortStableFunc(ts, func(a, b *Tensor) int {
if i, j := a.block(), b.block(); i > 0 && j > 0 {
return cmp.Compare(i, j)
}
return cmp.Compare(a.Name, b.Name)
})
slices.SortStableFunc(
ts,
func(a, b *Tensor) int {
return cmp.Or(
cmp.Compare(a.block(), b.block()),
cmp.Compare(a.Name, b.Name),
)
},
)
alignment := kv.Uint("general.alignment", 32)
var s uint64
for i := range ts {
@@ -571,7 +579,14 @@ func WriteGGUF(f *os.File, kv KV, ts []*Tensor) error {
return g.Wait()
}
func ggufWriteKV(ws io.WriteSeeker, k string, v any) error {
func ggufWriteKV(ws io.WriteSeeker, arch, k string, v any) error {
if !strings.HasPrefix(k, arch+".") &&
!strings.HasPrefix(k, "general.") &&
!strings.HasPrefix(k, "adapter.") &&
!strings.HasPrefix(k, "tokenizer.") {
k = arch + "." + k
}
slog.Debug(k, "type", fmt.Sprintf("%T", v))
if err := binary.Write(ws, binary.LittleEndian, uint64(len(k))); err != nil {
return err

View File

@@ -11,24 +11,24 @@ import (
)
func TestWriteGGUF(t *testing.T) {
r := rand.New(rand.NewPCG(0, 0))
b := bytes.NewBuffer(make([]byte, 2*3))
for range 8 {
t.Run("shuffle", func(t *testing.T) {
t.Parallel()
ts := []*Tensor{
{Name: "token_embd.weight", Shape: []uint64{2, 3}, WriterTo: bytes.NewBuffer(make([]byte, 2*3))},
{Name: "blk.0.attn_norm.weight", Shape: []uint64{2, 3}, WriterTo: bytes.NewBuffer(make([]byte, 2*3))},
{Name: "blk.1.attn_norm.weight", Shape: []uint64{2, 3}, WriterTo: bytes.NewBuffer(make([]byte, 2*3))},
{Name: "blk.2.attn_norm.weight", Shape: []uint64{2, 3}, WriterTo: bytes.NewBuffer(make([]byte, 2*3))},
{Name: "blk.3.attn_norm.weight", Shape: []uint64{2, 3}, WriterTo: bytes.NewBuffer(make([]byte, 2*3))},
{Name: "blk.4.attn_norm.weight", Shape: []uint64{2, 3}, WriterTo: bytes.NewBuffer(make([]byte, 2*3))},
{Name: "blk.5.attn_norm.weight", Shape: []uint64{2, 3}, WriterTo: bytes.NewBuffer(make([]byte, 2*3))},
{Name: "output_norm.weight", Shape: []uint64{3, 2}, WriterTo: bytes.NewBuffer(make([]byte, 3*2))},
{Name: "output.weight", Shape: []uint64{3, 2}, WriterTo: bytes.NewBuffer(make([]byte, 3*2))},
{Name: "token_embd.weight", Shape: []uint64{2, 3}, WriterTo: b},
{Name: "blk.0.ffn_norm.weight", Shape: []uint64{2, 3}, WriterTo: b},
{Name: "blk.0.attn_norm.weight", Shape: []uint64{2, 3}, WriterTo: b},
{Name: "blk.1.ffn_up.weight", Shape: []uint64{2, 3}, WriterTo: b},
{Name: "blk.2.ffn_norm.weight", Shape: []uint64{2, 3}, WriterTo: b},
{Name: "blk.1.ffn_down.weight", Shape: []uint64{2, 3}, WriterTo: b},
{Name: "blk.0.attn_k.weight", Shape: []uint64{2, 3}, WriterTo: b},
{Name: "output_norm.weight", Shape: []uint64{3, 2}, WriterTo: b},
{Name: "output.weight", Shape: []uint64{3, 2}, WriterTo: b},
}
r.Shuffle(len(ts), func(i, j int) {
rand.Shuffle(len(ts), func(i, j int) {
ts[i], ts[j] = ts[j], ts[i]
})
@@ -39,7 +39,12 @@ func TestWriteGGUF(t *testing.T) {
defer w.Close()
if err := WriteGGUF(w, KV{
"general.alignment": uint32(16),
"general.architecture": "test",
"general.alignment": uint32(16),
"test.key": "value",
"attention.key": "value2",
"tokenizer.key": "value3",
"adapter.key": "value4",
}, ts); err != nil {
t.Fatal(err)
}
@@ -56,21 +61,26 @@ func TestWriteGGUF(t *testing.T) {
}
if diff := cmp.Diff(KV{
"general.architecture": "test",
"general.alignment": uint32(16),
"general.parameter_count": uint64(54),
"test.key": "value",
"test.attention.key": "value2",
"tokenizer.key": "value3",
"adapter.key": "value4",
}, ff.KV()); diff != "" {
t.Errorf("Mismatch (-want +got):\n%s", diff)
}
if diff := cmp.Diff(Tensors{
Offset: 608,
Offset: 800,
items: []*Tensor{
{Name: "blk.0.attn_norm.weight", Offset: 0, Shape: []uint64{2, 3}},
{Name: "blk.1.attn_norm.weight", Offset: 32, Shape: []uint64{2, 3}},
{Name: "blk.2.attn_norm.weight", Offset: 64, Shape: []uint64{2, 3}},
{Name: "blk.3.attn_norm.weight", Offset: 96, Shape: []uint64{2, 3}},
{Name: "blk.4.attn_norm.weight", Offset: 128, Shape: []uint64{2, 3}},
{Name: "blk.5.attn_norm.weight", Offset: 160, Shape: []uint64{2, 3}},
{Name: "blk.0.attn_k.weight", Offset: 0, Shape: []uint64{2, 3}},
{Name: "blk.0.attn_norm.weight", Offset: 32, Shape: []uint64{2, 3}},
{Name: "blk.0.ffn_norm.weight", Offset: 64, Shape: []uint64{2, 3}},
{Name: "blk.1.ffn_down.weight", Offset: 96, Shape: []uint64{2, 3}},
{Name: "blk.1.ffn_up.weight", Offset: 128, Shape: []uint64{2, 3}},
{Name: "blk.2.ffn_norm.weight", Offset: 160, Shape: []uint64{2, 3}},
{Name: "output.weight", Offset: 192, Shape: []uint64{3, 2}},
{Name: "output_norm.weight", Offset: 224, Shape: []uint64{3, 2}},
{Name: "token_embd.weight", Offset: 256, Shape: []uint64{2, 3}},

View File

@@ -14,9 +14,9 @@ const (
FileTypeF16
fileTypeQ4_0
fileTypeQ4_1
fileTypeQ4_1_F16 // unused by GGML
fileTypeQ4_2 // unused by GGML
fileTypeQ4_3 // unused by GGML
fileTypeMXFP4 // originally fileTypeQ4_1_F16 // unused by GGML
fileTypeQ4_2 // unused by GGML
fileTypeQ4_3 // unused by GGML
FileTypeQ8_0
fileTypeQ5_0
fileTypeQ5_1
@@ -97,6 +97,8 @@ func (t FileType) String() string {
return "Q4_0"
case fileTypeQ4_1:
return "Q4_1"
case fileTypeMXFP4:
return "MXFP4"
case FileTypeQ8_0:
return "Q8_0"
case fileTypeQ5_0:
@@ -172,6 +174,8 @@ func (ftype FileType) ToTensorType() TensorType {
return TensorTypeQ2_K
case FileTypeBF16:
return TensorTypeBF16
case fileTypeMXFP4:
return TensorTypeMXFP4
default:
slog.Warn("unsupported file type", "type", ftype)
return 0 // F32
@@ -187,7 +191,7 @@ const (
TensorTypeF16
TensorTypeQ4_0
TensorTypeQ4_1
tensorTypeQ4_2 // unused by GGML
tensorTypeQ4_2
tensorTypeQ4_3 // unused by GGML
TensorTypeQ5_0
TensorTypeQ5_1
@@ -222,9 +226,10 @@ const (
tensorTypeIQ4_NL_4_4 // unused by GGML
tensorTypeIQ4_NL_4_8 // unused by GGML
tensorTypeIQ4_NL_8_8 // unused by GGML
TensorTypeMXFP4
)
// ParseFileType parses the provided GGUF file type
// ParseTensorType parses the provided GGUF tensor type
// Only Ollama supported types are considered valid
func ParseTensorType(s string) (TensorType, error) {
switch s {
@@ -260,6 +265,8 @@ func ParseTensorType(s string) (TensorType, error) {
return TensorTypeF64, nil
case "BF16":
return TensorTypeBF16, nil
case "MXFP4":
return TensorTypeMXFP4, nil
default:
return 0, fmt.Errorf("unsupported quantization type %s", s)
}
@@ -312,6 +319,8 @@ func (t TensorType) String() string {
return "F64"
case TensorTypeBF16:
return "BF16"
case 4, TensorTypeMXFP4:
return "MXFP4"
default:
return "unknown"
}

544
harmony/harmonyparser.go Normal file
View File

@@ -0,0 +1,544 @@
package harmony
import (
"encoding/json"
"fmt"
"log/slog"
"strings"
"unicode"
"github.com/ollama/ollama/api"
"github.com/ollama/ollama/logutil"
)
type harmonyParserState int
const (
harmonyParserState_LookingForMessageStart harmonyParserState = iota
harmonyParserState_ParsingHeader
harmonyParserState_ParsingContent
)
func (s harmonyParserState) String() string {
switch s {
// we're looking for the message start tag
case harmonyParserState_LookingForMessageStart:
return "LookingForMessageStart"
case harmonyParserState_ParsingHeader:
return "ParsingHeader"
case harmonyParserState_ParsingContent:
return "ParsingContent"
default:
return "Unknown"
}
}
type HarmonyParser struct {
state harmonyParserState
MessageStartTag string
MessageEndTag string
HeaderEndTag string
acc strings.Builder
lifetimeAcc strings.Builder
}
type HarmonyEvent interface {
isHarmonyEvent()
}
type HarmonyEventMessageStart struct{}
func (HarmonyEventMessageStart) isHarmonyEvent() {}
type HarmonyEventHeaderComplete struct {
Header HarmonyHeader
}
func (HarmonyEventHeaderComplete) isHarmonyEvent() {}
type HarmonyEventContentEmitted struct {
Content string
}
func (HarmonyEventContentEmitted) isHarmonyEvent() {}
type HarmonyEventMessageEnd struct{}
func (HarmonyEventMessageEnd) isHarmonyEvent() {}
type HarmonyHeader struct {
Role string
Channel string
Recipient string
}
func (s *HarmonyParser) AddImplicitStart() {
s.acc.WriteString("<|start|>assistant")
}
func (s *HarmonyParser) AddImplicitStartOrPrefill(lastMessage *api.Message) {
if lastMessage != nil && lastMessage.Role == "assistant" {
// handle prefilling conditions
if lastMessage.Content != "" {
s.acc.WriteString("<|start|>assistant<|channel|>final<|message|>")
return
} else if lastMessage.Thinking != "" {
s.acc.WriteString("<|start|>assistant<|channel|>analysis<|message|>")
return
}
}
s.AddImplicitStart()
}
func (s *HarmonyParser) AddContent(content string) []HarmonyEvent {
s.lifetimeAcc.WriteString(content)
s.acc.WriteString(content)
var events []HarmonyEvent
keepLooping := true
// we loop because we might pass through multiple parsing states in a single
// call to addContent, and we want to make sure callers don't have to wait for
// data that's already unambiguous
for keepLooping {
var newEvents []HarmonyEvent
newEvents, keepLooping = eat(s)
events = append(events, newEvents...)
}
return events
}
// the additional bool return is true iff we should continue eating
func eat(s *HarmonyParser) ([]HarmonyEvent, bool) {
switch s.state {
case harmonyParserState_LookingForMessageStart:
// does the acc contain the message start tag?
if strings.Contains(s.acc.String(), s.MessageStartTag) {
// split the acc into the message start tag and the rest
split := strings.SplitN(s.acc.String(), s.MessageStartTag, 2)
before := split[0]
if before != "" {
slog.Warn("harmony parser: found message start tag in the middle of the content", "content", s.acc.String())
}
after := split[1]
s.acc.Reset()
s.acc.WriteString(after)
s.state = harmonyParserState_ParsingHeader
return []HarmonyEvent{HarmonyEventMessageStart{}}, true
}
// no match, so we keep accumulating
return nil, false
case harmonyParserState_ParsingHeader:
if strings.Contains(s.acc.String(), s.HeaderEndTag) {
split := strings.SplitN(s.acc.String(), s.HeaderEndTag, 2)
header := split[0]
after := split[1]
s.acc.Reset()
s.acc.WriteString(after)
s.state = harmonyParserState_ParsingContent
return []HarmonyEvent{HarmonyEventHeaderComplete{Header: s.parseHeader(header)}}, true
}
return nil, false
case harmonyParserState_ParsingContent:
if strings.Contains(s.acc.String(), s.MessageEndTag) {
// if we already have the message end tag, we can emit the content up to it
split := strings.SplitN(s.acc.String(), s.MessageEndTag, 2)
content := split[0]
after := split[1]
s.acc.Reset()
s.acc.WriteString(after)
s.state = harmonyParserState_LookingForMessageStart
events := []HarmonyEvent{}
if content != "" {
events = append(events, HarmonyEventContentEmitted{Content: content})
}
events = append(events, HarmonyEventMessageEnd{})
return events, true
} else if overlapLen := overlap(s.acc.String(), s.MessageEndTag); overlapLen > 0 {
// if our suffix contains the start of the message end tag, we can emit
// the content up to the start of the message end tag
content := s.acc.String()[:len(s.acc.String())-overlapLen]
remaining := s.acc.String()[len(s.acc.String())-overlapLen:]
s.acc.Reset()
s.acc.WriteString(remaining)
// emit the content we know isn't part of the message end tag, and keep
// accumulating to disambiguate the rest
if content == "" {
return nil, false
}
return []HarmonyEvent{HarmonyEventContentEmitted{Content: content}}, false
} else {
// no end tag, so it's still normal content that we can immediately emit
content := s.acc.String()
if content == "" {
return nil, false
}
s.acc.Reset()
return []HarmonyEvent{HarmonyEventContentEmitted{Content: content}}, false
}
}
return nil, false
}
func (s *HarmonyParser) parseHeader(raw string) HarmonyHeader {
harmonyHeader := HarmonyHeader{}
// if `<|constrain|>` is present, ensure it has a space before it so it gets
// parsed as a separate token, even if the model didn't include the space
if strings.Contains(raw, "<|constrain|>") {
raw = strings.Replace(raw, "<|constrain|>", " <|constrain|>", 1)
raw = strings.TrimSpace(raw)
}
// look for the optional channel tag, which is `<|channel|>` followed by the
// channel name, all without any whitespace
channelIndex := strings.Index(raw, "<|channel|>")
if channelIndex != -1 {
before := raw[:channelIndex]
after := raw[channelIndex+len("<|channel|>"):]
// the channel name is `after` all the way up to the first (if any) whitespace character
idx := strings.IndexFunc(after, func(r rune) bool {
return unicode.IsSpace(r)
})
if idx == -1 {
idx = len(after)
}
harmonyHeader.Channel = after[:idx]
after = after[idx:]
// now we remove the channel tag from the raw string to further process
raw = before + after
raw = strings.TrimSpace(raw)
}
// split the header into whitespace-separated tokens
tokens := strings.Fields(raw)
// the first token is treated as the role
if len(tokens) == 0 {
slog.Error("harmony parser: missing role in header", "header", raw)
return harmonyHeader
}
role := tokens[0]
tokens = tokens[1:]
// special case: if role starts with to= then it's a tool call
if strings.HasPrefix(role, "to=") {
harmonyHeader.Recipient = role[3:]
harmonyHeader.Role = "tool"
} else {
harmonyHeader.Role = role
}
// the recipient (if any) can be specified before or after the channel tag, so
// we check it at the end once we've already parsed the channel and role
if harmonyHeader.Recipient == "" && len(tokens) > 0 && strings.HasPrefix(tokens[0], "to=") {
harmonyHeader.Recipient = tokens[0][3:]
}
return harmonyHeader
}
// longest overlap between suffix of s and prefix of delim
func overlap(s, delim string) int {
max := min(len(delim), len(s))
for i := max; i > 0; i-- {
if strings.HasSuffix(s, delim[:i]) {
return i
}
}
return 0
}
// harmonyMessageState represents the current state of message processing
type harmonyMessageState int
const (
harmonyMessageState_Normal harmonyMessageState = iota
harmonyMessageState_Thinking
harmonyMessageState_ToolCalling
)
// HarmonyMessageHandler processes harmony events and accumulates content appropriately.
// This is a higher level interface that maps harmony concepts into ollama concepts
type HarmonyMessageHandler struct {
state harmonyMessageState
HarmonyParser *HarmonyParser
FunctionNameMap *FunctionNameMap
toolAccumulator *HarmonyToolCallAccumulator
convertedTools map[string]struct{}
}
// NewHarmonyMessageHandler creates a new message handler
func NewHarmonyMessageHandler() *HarmonyMessageHandler {
return &HarmonyMessageHandler{
state: harmonyMessageState_Normal,
HarmonyParser: &HarmonyParser{
MessageStartTag: "<|start|>",
MessageEndTag: "<|end|>",
HeaderEndTag: "<|message|>",
},
FunctionNameMap: NewFunctionNameMap(),
convertedTools: make(map[string]struct{}),
}
}
// AddContent processes the content and returns the content, thinking, and tool content.
// content and thinking are already fully parsed, but tool content still needs to be passed to the tool parser
func (h *HarmonyMessageHandler) AddContent(content string, toolParser *HarmonyToolCallAccumulator) (string, string, string) {
contentSb := strings.Builder{}
thinkingSb := strings.Builder{}
toolContentSb := strings.Builder{}
events := h.HarmonyParser.AddContent(content)
for _, event := range events {
switch event := event.(type) {
case HarmonyEventHeaderComplete:
logutil.Trace("harmony event header complete", "header", event.Header)
switch event.Header.Channel {
case "analysis":
if event.Header.Recipient != "" {
h.state = harmonyMessageState_ToolCalling
// event.Header.Recipient is the tool name, something like
// "browser.search" for a built-in, or "functions.calc" for a
// custom one
toolParser.SetToolName(event.Header.Recipient)
} else {
h.state = harmonyMessageState_Thinking
}
case "commentary":
if event.Header.Recipient != "" {
h.state = harmonyMessageState_ToolCalling
toolParser.SetToolName(event.Header.Recipient)
} else {
h.state = harmonyMessageState_Normal
}
case "final":
h.state = harmonyMessageState_Normal
}
case HarmonyEventContentEmitted:
logutil.Trace("harmony event content", "content", event.Content, "state", h.state)
if h.state == harmonyMessageState_Normal {
contentSb.WriteString(event.Content)
} else if h.state == harmonyMessageState_Thinking {
thinkingSb.WriteString(event.Content)
} else if h.state == harmonyMessageState_ToolCalling {
toolContentSb.WriteString(event.Content)
}
case HarmonyEventMessageEnd:
h.state = harmonyMessageState_Normal
}
}
return contentSb.String(), thinkingSb.String(), toolContentSb.String()
}
func (h *HarmonyMessageHandler) CreateToolParser() *HarmonyToolCallAccumulator {
return &HarmonyToolCallAccumulator{
state: harmonyToolCallState_Normal,
currentToolName: nil,
}
}
type harmonyToolCallState int
const (
harmonyToolCallState_Normal harmonyToolCallState = iota
harmonyToolCallState_ToolCalling
)
type HarmonyToolCallAccumulator struct {
state harmonyToolCallState
acc strings.Builder
currentToolName *string
}
func (a *HarmonyToolCallAccumulator) SetToolName(toolName string) {
a.currentToolName = &toolName
}
func (a *HarmonyToolCallAccumulator) Add(content string) {
a.acc.WriteString(content)
}
func (a *HarmonyToolCallAccumulator) Drain() (*string, string) {
str := a.acc.String()
a.state = harmonyToolCallState_Normal
a.acc.Reset()
return a.currentToolName, str
}
func (a *HarmonyToolCallAccumulator) Content() string {
return a.acc.String()
}
// FunctionNameMap maps a user-specified function name to a valid function
// name for harmony (which look like TypeScript identifiers). This is needed to
// transform user-specified function names, which might contain characters that
// are not allowed in TypeScript identifiers
type FunctionNameMap struct {
userToHarmony map[string]string
harmonyToUser map[string]string
}
func NewFunctionNameMap() *FunctionNameMap {
return &FunctionNameMap{
userToHarmony: make(map[string]string),
harmonyToUser: make(map[string]string),
}
}
// Init initializes the handler with tools and optional last message
// Implements the Parser interface
func (h *HarmonyMessageHandler) Init(tools []api.Tool, lastMessage *api.Message) []api.Tool {
// Initialize the harmony parser
if h.HarmonyParser == nil {
h.HarmonyParser = &HarmonyParser{
MessageStartTag: "<|start|>",
MessageEndTag: "<|end|>",
HeaderEndTag: "<|message|>",
}
}
// Handle prefill for chat mode
if lastMessage != nil {
h.HarmonyParser.AddImplicitStartOrPrefill(lastMessage)
} else {
h.HarmonyParser.AddImplicitStart()
}
// Initialize tool accumulator
h.toolAccumulator = h.CreateToolParser()
// Process tools and return renamed versions
if len(tools) == 0 {
return tools
}
processedTools := make([]api.Tool, len(tools))
copy(processedTools, tools)
for i, tool := range processedTools {
if tool.Function.Name != "" {
processedTools[i].Function.Name = h.FunctionNameMap.ConvertAndAdd(tool.Function.Name)
h.convertedTools[tool.Function.Name] = struct{}{}
}
}
return processedTools
}
// Add implements the Parser interface - processes streamed content and extracts content, thinking, and tool calls
func (h *HarmonyMessageHandler) Add(s string, done bool) (content string, thinking string, calls []api.ToolCall, err error) {
content, thinking, toolContent := h.AddContent(s, h.toolAccumulator)
if toolContent != "" {
h.toolAccumulator.Add(toolContent)
}
// tool calls always happen one at a time, and always at the end of a message,
// so for simplicity we defer parsing them until we know we're done
if done {
toolName, raw := h.toolAccumulator.Drain()
if toolName != nil {
name := strings.TrimPrefix(*toolName, "functions.")
name = h.FunctionNameMap.OriginalFromConverted(name)
var args api.ToolCallFunctionArguments
if err := json.Unmarshal([]byte(raw), &args); err != nil {
return "", "", nil, fmt.Errorf("error parsing tool call: raw='%s', err=%w", raw, err)
}
calls = append(calls, api.ToolCall{Function: api.ToolCallFunction{Name: name, Arguments: args}})
}
}
return content, thinking, calls, nil
}
// HasToolSupport implements the Parser interface
func (h *HarmonyMessageHandler) HasToolSupport() bool {
return true
}
// HasThinkingSupport implements the Parser interface
func (h *HarmonyMessageHandler) HasThinkingSupport() bool {
return true
}
func (m *FunctionNameMap) ConvertAndAdd(userFunctionName string) string {
harmonyFunctionName := m.deriveName(userFunctionName)
// built-in functions should not be renamed
if userFunctionName == "browser.open" || userFunctionName == "browser.search" || userFunctionName == "browser.find" || userFunctionName == "python" {
harmonyFunctionName = userFunctionName
}
m.userToHarmony[userFunctionName] = harmonyFunctionName
m.harmonyToUser[harmonyFunctionName] = userFunctionName
return harmonyFunctionName
}
// OriginalFromConverted looks up the reverse-mapping of a previously-converted
// user->harmony function name. To unmap reliably, the mapping must exist, as
// the conversion process is not reversible without the appropriate state
func (m *FunctionNameMap) OriginalFromConverted(harmonyFunctionName string) string {
if userFunctionName, ok := m.harmonyToUser[harmonyFunctionName]; ok {
return userFunctionName
}
slog.Warn("harmony parser: no reverse mapping found for function name", "harmonyFunctionName", harmonyFunctionName)
// fallback to the original function name if we can't find a mapping
return harmonyFunctionName
}
// convertToValidChars converts a user-specified function name to a valid
// TypeScript identifier.
//
// Limitations:
//
// - This doesn't restrict reserved TypeScript keywords.
// - We don't perform a real ID_Start/ID_Continue check, and instead use the more
// restrictive unicode.IsLetter/unicode.IsDigit check. Unclear what kind of
// identifiers these models were trained on, so in the end we might want to
// convert unicode-heavy identifiers to their closest ASCII equivalents.
func (m *FunctionNameMap) convertToValidChars(userFunctionName string) string {
mapper := func(r rune) rune {
// first, replace certain characters with underscores
if r == ' ' || r == '-' || r == '.' {
return '_'
}
if unicode.IsLetter(r) || unicode.IsDigit(r) || r == '_' || r == '$' {
return r
}
// finally, remove any other characters
return -1
}
candidate := strings.Map(mapper, userFunctionName)
// set a default name if we end up with nothing left
if candidate == "" {
return "unnamed"
}
// if the candidate starts with a number, prepend an underscore to make it a
// valid identifier
if unicode.IsDigit(rune(candidate[0])) {
candidate = "_" + candidate
}
return candidate
}
func (m *FunctionNameMap) deriveName(userFunctionName string) string {
originalCandidate := m.convertToValidChars(userFunctionName)
candidate := originalCandidate
// Check for dupes, and if so, add a number to the end.
// We start at 2 because if we have dupes and the first is never renamed, it
// makes sense for them to be named, say, `f`, `f_2`, `f_3`
count := 2
for {
if _, exists := m.harmonyToUser[candidate]; !exists {
break
}
candidate = fmt.Sprintf("%s_%d", originalCandidate, count)
count++
}
return candidate
}

View File

@@ -0,0 +1,538 @@
package harmony
import (
"fmt"
"reflect"
"testing"
)
func TestHeaderParsing(t *testing.T) {
tests := []struct {
in, wantRole, wantChannel, wantRecipient string
}{
{
in: "assistant<|channel|>analysis",
wantRole: "assistant",
wantChannel: "analysis",
wantRecipient: "",
},
{
in: "assistant<|channel|>analysis to=functions.get_weather",
wantRole: "assistant",
wantChannel: "analysis",
wantRecipient: "functions.get_weather",
},
{
in: "assistant to=functions.get_weather<|channel|>analysis",
wantRole: "assistant",
wantChannel: "analysis",
wantRecipient: "functions.get_weather",
},
// special case where the role is replaced by the recipient (matches reference code)
{
in: "to=functions.get_weather<|channel|>analysis",
wantRole: "tool",
wantChannel: "analysis",
wantRecipient: "functions.get_weather",
},
// extra token after the recipient is ignored
{
in: "assistant to=functions.get_weather abc<|channel|>analysis",
wantRole: "assistant",
wantChannel: "analysis",
wantRecipient: "functions.get_weather",
},
// with constrain tag, recipient after channel tag
{
in: "assistant<|channel|>commentary to=functions.get_weather <|constrain|>json",
wantRole: "assistant",
wantChannel: "commentary",
wantRecipient: "functions.get_weather",
},
// with constrain tag, recipient before channel tag
{
in: "assistant to=functions.get_weather<|channel|>commentary <|constrain|>json",
wantRole: "assistant",
wantChannel: "commentary",
wantRecipient: "functions.get_weather",
},
// constrain tag without space
{
in: "assistant<|channel|>commentary to=functions.get_weather<|constrain|>json",
wantRole: "assistant",
wantChannel: "commentary",
wantRecipient: "functions.get_weather",
},
// constrain tag without space, different order
{
in: "assistant to=functions.get_weather<|channel|>commentary<|constrain|>json",
wantRole: "assistant",
wantChannel: "commentary",
wantRecipient: "functions.get_weather",
},
}
for i, tt := range tests {
parser := HarmonyParser{
MessageStartTag: "<|start|>",
MessageEndTag: "<|end|>",
HeaderEndTag: "<|message|>",
}
header := parser.parseHeader(tt.in)
if header.Role != tt.wantRole {
t.Errorf("case %d: got role \"%s\", want \"%s\"", i, header.Role, tt.wantRole)
}
if header.Channel != tt.wantChannel {
t.Errorf("case %d: got channel \"%s\", want \"%s\"", i, header.Channel, tt.wantChannel)
}
if header.Recipient != tt.wantRecipient {
t.Errorf("case %d: got recipient \"%s\", want \"%s\"", i, header.Recipient, tt.wantRecipient)
}
}
}
func TestHarmonyParserHeaderEvent(t *testing.T) {
tests := []struct {
in, wantRole, wantChannel, wantRecipient string
implicitStart bool
}{
{
in: "<|start|>user<|message|>What is 2 + 2?<|end|>",
wantRole: "user",
wantChannel: "",
wantRecipient: "",
},
{
in: "<|start|>assistant<|channel|>analysis<|message|>What is 2 + 2?<|end|>",
wantRole: "assistant",
wantChannel: "analysis",
wantRecipient: "",
},
{
in: "<|start|>assistant<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{\"location\":\"San Francisco\"}<|call|><|start|>functions.get_weather to=assistant<|message|>{\"sunny\": true, \"temperature\": 20}<|end|>",
wantRole: "assistant",
wantChannel: "commentary",
wantRecipient: "functions.get_weather",
},
{
in: "<|channel|>analysis<|message|>User asks weather in SF. We need location. Use get_current_weather with location \"San Francisco, CA\".<|end|><|start|>assistant<|channel|>commentary to=functions.get_current_weather <|constrain|>json<|message|>{\"location\":\"San Francisco, CA\"}<|call|>",
wantRole: "assistant",
wantChannel: "analysis",
wantRecipient: "",
implicitStart: true,
},
}
for i, tt := range tests {
parser := HarmonyParser{
MessageStartTag: "<|start|>",
MessageEndTag: "<|end|>",
HeaderEndTag: "<|message|>",
}
if tt.implicitStart {
parser.AddImplicitStart()
}
gotEvents := parser.AddContent(tt.in)
if len(gotEvents) == 0 {
t.Errorf("case %d: got no events, want at least one", i)
}
var firstHeaderEvent *HarmonyEventHeaderComplete
// print events
for _, event := range gotEvents {
fmt.Printf("event: %+v\n", event)
}
for _, event := range gotEvents {
if event, ok := event.(HarmonyEventHeaderComplete); ok {
firstHeaderEvent = &event
break
}
}
if firstHeaderEvent == nil {
t.Errorf("case %d: got no header complete event, want one", i)
continue
}
gotHeader := firstHeaderEvent.Header
if gotHeader.Role != tt.wantRole || gotHeader.Channel != tt.wantChannel || gotHeader.Recipient != tt.wantRecipient {
t.Errorf("case %d: got header %+v, want role=%s channel=%s recipient=%s", i, gotHeader, tt.wantRole, tt.wantChannel, tt.wantRecipient)
}
}
}
func TestHarmonyParserNonStreaming(t *testing.T) {
tests := []struct {
in string
implicitStart bool
wantEvents []HarmonyEvent
}{
{
in: "<|start|>user<|message|>What is 2 + 2?<|end|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "user", Channel: "", Recipient: ""}},
HarmonyEventContentEmitted{Content: "What is 2 + 2?"},
HarmonyEventMessageEnd{},
},
},
{
in: "<|start|>assistant<|channel|>analysis<|message|>The answer is 4<|end|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "analysis", Recipient: ""}},
HarmonyEventContentEmitted{Content: "The answer is 4"},
HarmonyEventMessageEnd{},
},
},
{
in: "<|start|>assistant<|channel|>commentary to=functions.calc<|message|>Computing...<|end|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "commentary", Recipient: "functions.calc"}},
HarmonyEventContentEmitted{Content: "Computing..."},
HarmonyEventMessageEnd{},
},
},
{
in: "<|start|>user<|message|><|end|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "user", Channel: "", Recipient: ""}},
HarmonyEventMessageEnd{},
},
},
{
in: "<|start|>user<|message|>Hello<|end|><|start|>assistant<|message|>Hi!<|end|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "user", Channel: "", Recipient: ""}},
HarmonyEventContentEmitted{Content: "Hello"},
HarmonyEventMessageEnd{},
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "", Recipient: ""}},
HarmonyEventContentEmitted{Content: "Hi!"},
HarmonyEventMessageEnd{},
},
},
{
in: "<|channel|>analysis<|message|>Thinking about the request<|end|>",
implicitStart: true,
wantEvents: []HarmonyEvent{HarmonyEventMessageStart{}, HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "analysis", Recipient: ""}}, HarmonyEventContentEmitted{Content: "Thinking about the request"}, HarmonyEventMessageEnd{}},
},
}
for i, tt := range tests {
parser := HarmonyParser{
MessageStartTag: "<|start|>",
MessageEndTag: "<|end|>",
HeaderEndTag: "<|message|>",
}
if tt.implicitStart {
parser.AddImplicitStart()
}
gotEvents := parser.AddContent(tt.in)
if !reflect.DeepEqual(gotEvents, tt.wantEvents) {
t.Errorf("case %d: got events %#v, want %#v", i, gotEvents, tt.wantEvents)
}
}
}
func TestHarmonyParserStreaming(t *testing.T) {
type step struct {
input string
wantEvents []HarmonyEvent
}
cases := []struct {
desc string
implicitStart bool
steps []step
}{
{
desc: "simple message streamed character by character",
steps: []step{
{
input: "<",
wantEvents: nil,
},
{
input: "|",
wantEvents: nil,
},
{
input: "start|>u",
wantEvents: []HarmonyEvent{HarmonyEventMessageStart{}},
},
{
input: "ser<|mess",
wantEvents: nil,
},
{
input: "age|>Hi",
wantEvents: []HarmonyEvent{
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "user", Channel: "", Recipient: ""}},
HarmonyEventContentEmitted{Content: "Hi"},
},
},
{
input: " there",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: " there"}},
},
{
input: "<|e",
wantEvents: nil,
},
{
input: "nd|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageEnd{}},
},
},
},
{
desc: "message with channel streamed",
steps: []step{
{
input: "<|start|>assistant",
wantEvents: []HarmonyEvent{HarmonyEventMessageStart{}},
},
{
input: "<|chan",
wantEvents: nil,
},
{
input: "nel|>analysis",
wantEvents: nil,
},
{
input: "<|message|>",
wantEvents: []HarmonyEvent{HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "analysis", Recipient: ""}}},
},
{
input: "Thinking",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: "Thinking"}},
},
{
input: "...",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: "..."}},
},
{
input: "<|end|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageEnd{}},
},
},
},
{
desc: "message with channel and recipient",
steps: []step{
{
input: "<|start|>assistant<|channel|>commentary to=functions.calc<|message|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "commentary", Recipient: "functions.calc"}},
},
},
{
input: "{\"x\": 5}",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: "{\"x\": 5}"}},
},
{
input: "<|end|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageEnd{}},
},
},
},
{
desc: "message with channel and recipient (receipient before channel)",
steps: []step{
{
input: "<|start|>assistant to=functions.calc<|channel|>commentary<|message|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "commentary", Recipient: "functions.calc"}},
},
},
{
input: "{\"x\": 5}",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: "{\"x\": 5}"}},
},
{
input: "<|end|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageEnd{}},
},
},
},
{
desc: "implicit start with channel",
implicitStart: true,
steps: []step{
{
input: "<|channel|>thinking",
wantEvents: []HarmonyEvent{HarmonyEventMessageStart{}},
},
{
input: "<|message|>",
wantEvents: []HarmonyEvent{HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "thinking", Recipient: ""}}},
},
{
input: "Processing request",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: "Processing request"}},
},
{
input: "<|end|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageEnd{}},
},
},
},
{
desc: "multiple messages streamed",
steps: []step{
{
input: "<|start|>user<|message|>Hello<|end|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "user", Channel: "", Recipient: ""}},
HarmonyEventContentEmitted{Content: "Hello"},
HarmonyEventMessageEnd{},
},
},
{
input: "<|start|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageStart{}},
},
{
input: "assistant<|message|>",
wantEvents: []HarmonyEvent{HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "assistant", Channel: "", Recipient: ""}}},
},
{
input: "Hi!",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: "Hi!"}},
},
{
input: "<|end|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageEnd{}},
},
},
},
{
desc: "empty message",
steps: []step{
{
input: "<|start|>system<|message|><|end|>",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "system", Channel: "", Recipient: ""}},
HarmonyEventMessageEnd{},
},
},
},
},
{
desc: "partial tag that looks like end but isn't",
steps: []step{
{
input: "<|start|>user<|message|>test<|e",
wantEvents: []HarmonyEvent{
HarmonyEventMessageStart{},
HarmonyEventHeaderComplete{Header: HarmonyHeader{Role: "user", Channel: "", Recipient: ""}},
HarmonyEventContentEmitted{Content: "test"},
},
},
{
input: "xample|>more",
wantEvents: []HarmonyEvent{HarmonyEventContentEmitted{Content: "<|example|>more"}},
},
{
input: "<|end|>",
wantEvents: []HarmonyEvent{HarmonyEventMessageEnd{}},
},
},
},
}
for _, tc := range cases {
t.Run(tc.desc, func(t *testing.T) {
parser := HarmonyParser{
MessageStartTag: "<|start|>",
MessageEndTag: "<|end|>",
HeaderEndTag: "<|message|>",
}
if tc.implicitStart {
parser.AddImplicitStart()
}
for i, step := range tc.steps {
gotEvents := parser.AddContent(step.input)
if !reflect.DeepEqual(gotEvents, step.wantEvents) {
t.Errorf("step %d: input %q: got events %#v, want %#v", i, step.input, gotEvents, step.wantEvents)
}
}
})
}
}
// TestFunctionConvertToValidChars tests only FunctionNameMap.convert(), which doesn't
// handle any saving (and therefore no dupe handling)
func TestFunctionConvertToValidChars(t *testing.T) {
tests := []struct {
name string
in string
want string
}{
{name: "replace spaces with underscores", in: "get weather", want: "get_weather"},
{name: "replace hyphens with underscores", in: "get-weather", want: "get_weather"},
{name: "replace periods with underscores", in: "get.weather", want: "get_weather"},
{name: "disallow non-word characters", in: "get weather!", want: "get_weather"},
{name: "strip out invalid non-alphanumeric unicode characters", in: "a🫠bc", want: "abc"},
{name: "names that only contain invalid characters", in: "🫠", want: "unnamed"},
{name: "leading number", in: "123", want: "_123"},
{name: "$ allowed", in: "$", want: "$"},
// show that we allow weird unicode letter characters, though we might want
// to convert them to their closest ASCII equivalents in the future
{name: "allow weird unicode letter characters", in: "𝓸𝓵𝓵𝓪𝓶𝓪", want: "𝓸𝓵𝓵𝓪𝓶𝓪"},
// names that look like words but are invalid (i.e., not ID_Start/ID_Continue)
{name: "disallow non-word characters that look like words", in: "ⓞⓛⓛⓐⓜⓐ123", want: "_123"},
}
for i, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
parser := NewFunctionNameMap()
got := parser.convertToValidChars(tt.in)
if got != tt.want {
t.Errorf("case %d: got %q, want %q", i, got, tt.want)
}
})
}
}
func TestFunctionConvertAndAdd(t *testing.T) {
// make a fresh map for each test, but within a test use the same map so we can test for dupe handling
tests := []struct {
name string
in []string
want []string
}{
{name: "basic dupe handling", in: []string{"get weather", "get weather"}, want: []string{"get_weather", "get_weather_2"}},
{name: "dupes from different user-specified names", in: []string{"get weather", "get_weather", "get-weather"}, want: []string{"get_weather", "get_weather_2", "get_weather_3"}},
{name: "non dupes after dupes", in: []string{"get weather", "get_weather", "get-weather", "something-different"}, want: []string{"get_weather", "get_weather_2", "get_weather_3", "something_different"}},
{name: "multiple sets of dupes", in: []string{"a", "a", "b", "a", "a", "b", "a"}, want: []string{"a", "a_2", "b", "a_3", "a_4", "b_2", "a_5"}},
{name: "built-in functions should not be renamed", in: []string{"browser.open", "python", "not.a.built-in.function", "browser.not_a_real_built_in"}, want: []string{"browser.open", "python", "not_a_built_in_function", "browser_not_a_real_built_in"}},
}
for i, tt := range tests {
parser := NewFunctionNameMap()
t.Run(tt.name, func(t *testing.T) {
for j, in := range tt.in {
got := parser.ConvertAndAdd(in)
want := tt.want[j]
if got != want {
t.Errorf("case %d: got %q, want %q", i, got, want)
}
// check that the maps are correct
if parser.userToHarmony[in] != want {
t.Errorf("case %d: userToHarmony[%q] = %q, want %q", i, in, parser.userToHarmony[in], want)
}
if parser.harmonyToUser[want] != in {
t.Errorf("case %d: harmonyToUser[%q] = %q, want %q", i, want, parser.harmonyToUser[want], in)
}
}
})
}
}

View File

@@ -2,10 +2,16 @@
This directory contains integration tests to exercise Ollama end-to-end to verify behavior
By default, these tests are disabled so `go test ./...` will exercise only unit tests. To run integration tests you must pass the integration tag. `go test -tags=integration ./...`
By default, these tests are disabled so `go test ./...` will exercise only unit tests. To run integration tests you must pass the integration tag. `go test -tags=integration ./...` Some tests require additional tags to enable to allow scoped testing to keep the duration reasonable. For example, testing a broad set of models requires `-tags=integration,models` and a longer timeout (~60m or more depending on the speed of your GPU.). To view the current set of tag combinations use `find integration -type f | xargs grep "go:build"`
The integration tests have 2 modes of operating.
1. By default, they will start the server on a random port, run the tests, and then shutdown the server.
2. If `OLLAMA_TEST_EXISTING` is set to a non-empty string, the tests will run against an existing running server, which can be remote
2. If `OLLAMA_TEST_EXISTING` is set to a non-empty string, the tests will run against an existing running server, which can be remote based on your `OLLAMA_HOST` environment variable
> [!IMPORTANT]
> Before running the tests locally without the "test existing" setting, compile ollama from the top of the source tree `go build .` in addition to GPU support with cmake if applicable on your platform. The integration tests expect to find an ollama binary at the top of the tree.
Many tests use a default small model suitable to run on many systems. You can override this default model by setting `OLLAMA_TEST_DEFAULT_MODEL`

View File

@@ -22,13 +22,12 @@ func TestAPIGenerate(t *testing.T) {
// Set up the test data
req := api.GenerateRequest{
Model: smol,
Prompt: "why is the sky blue? be brief",
Prompt: blueSkyPrompt,
Options: map[string]interface{}{
"temperature": 0,
"seed": 123,
},
}
anyResp := []string{"rayleigh", "scattering"}
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
@@ -120,14 +119,14 @@ func TestAPIGenerate(t *testing.T) {
// Verify the response contains the expected data
response := buf.String()
atLeastOne := false
for _, resp := range anyResp {
for _, resp := range blueSkyExpected {
if strings.Contains(strings.ToLower(response), resp) {
atLeastOne = true
break
}
}
if !atLeastOne {
t.Errorf("none of %v found in %s", anyResp, response)
t.Errorf("none of %v found in %s", blueSkyExpected, response)
}
case <-ctx.Done():
t.Error("outer test context done while waiting for generate")
@@ -181,7 +180,7 @@ func TestAPIChat(t *testing.T) {
Messages: []api.Message{
{
Role: "user",
Content: "why is the sky blue? be brief",
Content: blueSkyPrompt,
},
},
Options: map[string]interface{}{
@@ -189,7 +188,6 @@ func TestAPIChat(t *testing.T) {
"seed": 123,
},
}
anyResp := []string{"rayleigh", "scattering"}
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
@@ -279,14 +277,14 @@ func TestAPIChat(t *testing.T) {
// Verify the response contains the expected data
response := buf.String()
atLeastOne := false
for _, resp := range anyResp {
for _, resp := range blueSkyExpected {
if strings.Contains(strings.ToLower(response), resp) {
atLeastOne = true
break
}
}
if !atLeastOne {
t.Errorf("none of %v found in %s", anyResp, response)
t.Errorf("none of %v found in %s", blueSkyExpected, response)
}
case <-ctx.Done():
t.Error("outer test context done while waiting for chat")
@@ -390,7 +388,7 @@ func TestAPIEmbeddings(t *testing.T) {
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
req := api.EmbeddingRequest{
Model: "orca-mini",
Model: libraryEmbedModels[0],
Prompt: "why is the sky blue?",
Options: map[string]interface{}{
"temperature": 0,
@@ -410,3 +408,99 @@ func TestAPIEmbeddings(t *testing.T) {
t.Errorf("zero length embedding response")
}
}
func TestAPIToolCalling(t *testing.T) {
initialTimeout := 60 * time.Second
streamTimeout := 30 * time.Second
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
modelName := "qwen3:0.6b"
if err := PullIfMissing(ctx, client, modelName); err != nil {
t.Fatalf("pull failed %s", err)
}
tools := []api.Tool{
{
Type: "function",
Function: api.ToolFunction{
Name: "get_weather",
Description: "Get the current weather in a given location",
Parameters: api.ToolFunctionParameters{
Type: "object",
Required: []string{"location"},
Properties: map[string]api.ToolProperty{
"location": {
Type: api.PropertyType{"string"},
Description: "The city and state, e.g. San Francisco, CA",
},
},
},
},
},
}
req := api.ChatRequest{
Model: modelName,
Messages: []api.Message{
{
Role: "user",
Content: "Call get_weather with location set to San Francisco.",
},
},
Tools: tools,
Options: map[string]any{
"temperature": 0,
},
}
stallTimer := time.NewTimer(initialTimeout)
var gotToolCall bool
var lastToolCall api.ToolCall
fn := func(response api.ChatResponse) error {
if len(response.Message.ToolCalls) > 0 {
gotToolCall = true
lastToolCall = response.Message.ToolCalls[len(response.Message.ToolCalls)-1]
}
if !stallTimer.Reset(streamTimeout) {
return fmt.Errorf("stall was detected while streaming response, aborting")
}
return nil
}
stream := true
req.Stream = &stream
done := make(chan int)
var genErr error
go func() {
genErr = client.Chat(ctx, &req, fn)
done <- 0
}()
select {
case <-stallTimer.C:
t.Errorf("tool-calling chat never started. Timed out after: %s", initialTimeout.String())
case <-done:
if genErr != nil {
t.Fatalf("chat failed: %v", genErr)
}
if !gotToolCall {
t.Fatalf("expected at least one tool call, got none")
}
if lastToolCall.Function.Name != "get_weather" {
t.Errorf("unexpected tool called: got %q want %q", lastToolCall.Function.Name, "get_weather")
}
if _, ok := lastToolCall.Function.Arguments["location"]; !ok {
t.Errorf("expected tool arguments to include 'location', got: %s", lastToolCall.Function.Arguments.String())
}
case <-ctx.Done():
t.Error("outer test context done while waiting for tool-calling chat")
}
}

View File

@@ -11,23 +11,27 @@ import (
"time"
"github.com/ollama/ollama/api"
"github.com/stretchr/testify/require"
)
func TestBlueSky(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
// Set up the test data
req := api.GenerateRequest{
Model: smol,
Prompt: "why is the sky blue?",
req := api.ChatRequest{
Model: smol,
Messages: []api.Message{
{
Role: "user",
Content: blueSkyPrompt,
},
},
Stream: &stream,
Options: map[string]any{
"temperature": 0,
"seed": 123,
},
}
GenerateTestHelper(ctx, t, req, []string{"rayleigh", "scattering"})
ChatTestHelper(ctx, t, req, blueSkyExpected)
}
func TestUnicode(t *testing.T) {
@@ -35,10 +39,15 @@ func TestUnicode(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
defer cancel()
// Set up the test data
req := api.GenerateRequest{
req := api.ChatRequest{
// DeepSeek has a Unicode tokenizer regex, making it a unicode torture test
Model: "deepseek-coder-v2:16b-lite-instruct-q2_K",
Prompt: "天空为什么是蓝色的?",
Model: "deepseek-coder-v2:16b-lite-instruct-q2_K", // TODO is there an ollama-engine model we can switch to and keep the coverage?
Messages: []api.Message{
{
Role: "user",
Content: "天空为什么是蓝色的?", // Why is the sky blue?
},
},
Stream: &stream,
Options: map[string]any{
"temperature": 0,
@@ -50,17 +59,39 @@ func TestUnicode(t *testing.T) {
}
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
require.NoError(t, PullIfMissing(ctx, client, req.Model))
DoGenerate(ctx, t, client, req, []string{"散射", "频率"}, 120*time.Second, 120*time.Second)
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatal(err)
}
slog.Info("loading", "model", req.Model)
err := client.Generate(ctx, &api.GenerateRequest{Model: req.Model}, func(response api.GenerateResponse) error { return nil })
if err != nil {
t.Fatalf("failed to load model %s: %s", req.Model, err)
}
defer func() {
// best effort unload once we're done with the model
client.Generate(ctx, &api.GenerateRequest{Model: req.Model, KeepAlive: &api.Duration{Duration: 0}}, func(rsp api.GenerateResponse) error { return nil })
}()
skipIfNotGPULoaded(ctx, t, client, req.Model, 100)
DoChat(ctx, t, client, req, []string{
"散射", // scattering
"频率", // frequency
}, 120*time.Second, 120*time.Second)
}
func TestExtendedUnicodeOutput(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
// Set up the test data
req := api.GenerateRequest{
Model: "gemma2:2b",
Prompt: "Output some smily face emoji",
req := api.ChatRequest{
Model: "gemma2:2b",
Messages: []api.Message{
{
Role: "user",
Content: "Output some smily face emoji",
},
},
Stream: &stream,
Options: map[string]any{
"temperature": 0,
@@ -69,8 +100,10 @@ func TestExtendedUnicodeOutput(t *testing.T) {
}
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
require.NoError(t, PullIfMissing(ctx, client, req.Model))
DoGenerate(ctx, t, client, req, []string{"😀", "😊", "😁", "😂", "😄", "😃"}, 120*time.Second, 120*time.Second)
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatal(err)
}
DoChat(ctx, t, client, req, []string{"😀", "😊", "😁", "😂", "😄", "😃"}, 120*time.Second, 120*time.Second)
}
func TestUnicodeModelDir(t *testing.T) {
@@ -84,7 +117,9 @@ func TestUnicodeModelDir(t *testing.T) {
}
modelDir, err := os.MkdirTemp("", "ollama_埃")
require.NoError(t, err)
if err != nil {
t.Fatal(err)
}
defer os.RemoveAll(modelDir)
slog.Info("unicode", "OLLAMA_MODELS", modelDir)
@@ -93,14 +128,19 @@ func TestUnicodeModelDir(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
req := api.GenerateRequest{
Model: smol,
Prompt: "why is the sky blue?",
req := api.ChatRequest{
Model: smol,
Messages: []api.Message{
{
Role: "user",
Content: blueSkyPrompt,
},
},
Stream: &stream,
Options: map[string]any{
"temperature": 0,
"seed": 123,
},
}
GenerateTestHelper(ctx, t, req, []string{"rayleigh", "scattering"})
ChatTestHelper(ctx, t, req, blueSkyExpected)
}

View File

@@ -4,257 +4,185 @@ package integration
import (
"context"
"fmt"
"log/slog"
"math"
"math/rand"
"os"
"strconv"
"sync"
"testing"
"time"
"github.com/stretchr/testify/require"
"github.com/ollama/ollama/api"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/format"
)
func TestMultiModelConcurrency(t *testing.T) {
var (
req = [2]api.GenerateRequest{
{
Model: "llama3.2:1b",
Prompt: "why is the ocean blue?",
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
}, {
Model: "tinydolphin",
Prompt: "what is the origin of the us thanksgiving holiday?",
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
},
}
resp = [2][]string{
{"sunlight"},
{"england", "english", "massachusetts", "pilgrims", "british", "festival"},
}
)
var wg sync.WaitGroup
wg.Add(len(req))
ctx, cancel := context.WithTimeout(context.Background(), time.Second*240)
defer cancel()
// Send multiple requests in parallel (concurrently) to a single model and ensure responses are expected
func TestConcurrentChat(t *testing.T) {
// Assumes all requests have the same model
req, resp := ChatRequests()
numParallel := int(envconfig.NumParallel() + 1)
iterLimit := 3
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
for i := 0; i < len(req); i++ {
require.NoError(t, PullIfMissing(ctx, client, req[i].Model))
}
for i := 0; i < len(req); i++ {
go func(i int) {
defer wg.Done()
// Note: CPU based inference can crawl so don't give up too quickly
DoGenerate(ctx, t, client, req[i], resp[i], 90*time.Second, 30*time.Second)
}(i)
}
wg.Wait()
}
func TestIntegrationConcurrentPredict(t *testing.T) {
req, resp := GenerateRequests()
reqLimit := len(req)
iterLimit := 5
if s := os.Getenv("OLLAMA_MAX_VRAM"); s != "" {
maxVram, err := strconv.ParseUint(s, 10, 64)
require.NoError(t, err)
// Don't hammer on small VRAM cards...
if maxVram < 4*format.GibiByte {
reqLimit = min(reqLimit, 2)
iterLimit = 2
}
}
ctx, cancel := context.WithTimeout(context.Background(), 9*time.Minute)
softTimeout, hardTimeout := getTimeouts(t)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
// Get the server running (if applicable) warm the model up with a single initial request
DoGenerate(ctx, t, client, req[0], resp[0], 60*time.Second, 10*time.Second)
slog.Info("loading", "model", req[0].Model)
err := client.Generate(ctx,
&api.GenerateRequest{Model: req[0].Model, KeepAlive: &api.Duration{Duration: 10 * time.Second}},
func(response api.GenerateResponse) error { return nil },
)
if err != nil {
t.Fatalf("failed to load model %s: %s", req[0].Model, err)
}
var wg sync.WaitGroup
wg.Add(reqLimit)
for i := 0; i < reqLimit; i++ {
r := rand.New(rand.NewSource(0))
wg.Add(numParallel)
for i := range numParallel {
go func(i int) {
defer wg.Done()
for j := 0; j < iterLimit; j++ {
slog.Info("Starting", "req", i, "iter", j)
if time.Now().Sub(started) > softTimeout {
slog.Info("exceeded soft timeout, winding down test")
return
}
k := r.Int() % len(req)
slog.Info("Starting", "thread", i, "iter", j)
// On slower GPUs it can take a while to process the concurrent requests
// so we allow a much longer initial timeout
DoGenerate(ctx, t, client, req[i], resp[i], 120*time.Second, 20*time.Second)
DoChat(ctx, t, client, req[k], resp[k], 120*time.Second, 20*time.Second)
}
}(i)
}
wg.Wait()
}
// Stress the system if we know how much VRAM it has, and attempt to load more models than will fit
// Stress the scheduler and attempt to load more models than will fit to cause thrashing
// This test will always load at least 2 models even on CPU based systems
func TestMultiModelStress(t *testing.T) {
s := os.Getenv("OLLAMA_MAX_VRAM") // TODO - discover actual VRAM
s := os.Getenv("OLLAMA_MAX_VRAM")
if s == "" {
t.Skip("OLLAMA_MAX_VRAM not specified, can't pick the right models for the stress test")
s = "0"
}
maxVram, err := strconv.ParseUint(s, 10, 64)
if err != nil {
t.Fatal(err)
}
if maxVram < 2*format.GibiByte {
t.Skip("VRAM less than 2G, skipping model stress tests")
// All models compatible with ollama-engine
smallModels := []string{
"llama3.2:1b",
"qwen3:0.6b",
"gemma2:2b",
"deepseek-r1:1.5b", // qwen2 arch
"gemma3:270m",
}
mediumModels := []string{
"llama3.2:3b", // ~3.4G
"qwen3:8b", // ~6.6G
"gpt-oss:20b", // ~15G
"deepseek-r1:7b", // ~5.6G
"gemma3:4b", // ~5.8G
"gemma2:9b", // ~8.1G
}
type model struct {
name string
size uint64 // Approximate amount of VRAM they typically use when fully loaded in VRAM
}
smallModels := []model{
{
name: "llama3.2:1b",
size: 2876 * format.MebiByte,
},
{
name: "phi",
size: 2616 * format.MebiByte,
},
{
name: "gemma:2b",
size: 2364 * format.MebiByte,
},
{
name: "stable-code:3b",
size: 2608 * format.MebiByte,
},
{
name: "starcoder2:3b",
size: 2166 * format.MebiByte,
},
}
mediumModels := []model{
{
name: "llama2",
size: 5118 * format.MebiByte,
},
{
name: "mistral",
size: 4620 * format.MebiByte,
},
{
name: "orca-mini:7b",
size: 5118 * format.MebiByte,
},
{
name: "dolphin-mistral",
size: 4620 * format.MebiByte,
},
{
name: "gemma:7b",
size: 5000 * format.MebiByte,
},
{
name: "codellama:7b",
size: 5118 * format.MebiByte,
},
}
// These seem to be too slow to be useful...
// largeModels := []model{
// {
// name: "llama2:13b",
// size: 7400 * format.MebiByte,
// },
// {
// name: "codellama:13b",
// size: 7400 * format.MebiByte,
// },
// {
// name: "orca-mini:13b",
// size: 7400 * format.MebiByte,
// },
// {
// name: "gemma:7b",
// size: 5000 * format.MebiByte,
// },
// {
// name: "starcoder2:15b",
// size: 9100 * format.MebiByte,
// },
// }
var chosenModels []model
var chosenModels []string
switch {
case maxVram < 10000*format.MebiByte:
slog.Info("selecting small models")
chosenModels = smallModels
// case maxVram < 30000*format.MebiByte:
default:
slog.Info("selecting medium models")
chosenModels = mediumModels
// default:
// slog.Info("selecting large models")
// chosenModels = largeModels
}
req, resp := GenerateRequests()
for i := range req {
if i > len(chosenModels) {
break
}
req[i].Model = chosenModels[i].name
}
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Minute) // TODO baseline -- 10m too short
softTimeout, hardTimeout := getTimeouts(t)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
initialTimeout := 120 * time.Second
streamTimeout := 20 * time.Second
// Make sure all the models are pulled before we get started
for _, r := range req {
require.NoError(t, PullIfMissing(ctx, client, r.Model))
for _, model := range chosenModels {
if err := PullIfMissing(ctx, client, model); err != nil {
t.Fatal(err)
}
}
var wg sync.WaitGroup
consumed := uint64(256 * format.MebiByte) // Assume some baseline usage
for i := 0; i < len(req); i++ {
// Always get at least 2 models, but don't overshoot VRAM too much or we'll take too long
if i > 1 && consumed > maxVram {
slog.Info("achieved target vram exhaustion", "count", i, "vram", format.HumanBytes2(maxVram), "models", format.HumanBytes2(consumed))
break
// Determine how many models we can load in parallel before we exceed VRAM
// The intent is to go 1 over what can fit so we force the scheduler to thrash
targetLoadCount := 0
slog.Info("Loading models to find how many can fit in VRAM before overflowing")
chooseModels:
for i, model := range chosenModels {
req := &api.GenerateRequest{Model: model}
slog.Info("loading", "model", model)
err = client.Generate(ctx, req, func(response api.GenerateResponse) error { return nil })
if err != nil {
t.Fatalf("failed to load model %s: %s", model, err)
}
consumed += chosenModels[i].size
slog.Info("target vram", "count", i, "vram", format.HumanBytes2(maxVram), "models", format.HumanBytes2(consumed))
targetLoadCount++
if i > 0 {
models, err := client.ListRunning(ctx)
if err != nil {
t.Fatalf("failed to list running models: %s", err)
}
if len(models.Models) < targetLoadCount {
loaded := []string{}
for _, m := range models.Models {
loaded = append(loaded, m.Name)
}
slog.Info("found model load capacity", "target", targetLoadCount, "current", loaded, "chosen", chosenModels[:targetLoadCount])
break
}
// Effectively limit model count to 2 on CPU only systems to avoid thrashing and timeouts
for _, m := range models.Models {
if m.SizeVRAM == 0 {
slog.Info("model running on CPU", "name", m.Name, "target", targetLoadCount, "chosen", chosenModels[:targetLoadCount])
initialTimeout = 240 * time.Second
streamTimeout = 30 * time.Second
break chooseModels
}
}
}
}
if targetLoadCount == len(chosenModels) {
// TODO consider retrying the medium models
slog.Warn("all models being used without exceeding VRAM, set OLLAMA_MAX_VRAM so test can pick larger models")
}
r := rand.New(rand.NewSource(0))
var wg sync.WaitGroup
for i := range targetLoadCount {
wg.Add(1)
go func(i int) {
defer wg.Done()
reqs, resps := ChatRequests()
for j := 0; j < 3; j++ {
slog.Info("Starting", "req", i, "iter", j, "model", req[i].Model)
DoGenerate(ctx, t, client, req[i], resp[i], 120*time.Second, 5*time.Second)
if time.Now().Sub(started) > softTimeout {
slog.Info("exceeded soft timeout, winding down test")
return
}
k := r.Int() % len(reqs)
reqs[k].Model = chosenModels[i]
slog.Info("Starting", "model", reqs[k].Model, "iteration", j, "request", reqs[k].Messages[0].Content)
DoChat(ctx, t, client, reqs[k], resps[k], initialTimeout, streamTimeout)
}
}(i)
}
go func() {
for {
time.Sleep(2 * time.Second)
time.Sleep(10 * time.Second)
select {
case <-ctx.Done():
return
@@ -265,7 +193,21 @@ func TestMultiModelStress(t *testing.T) {
continue
}
for _, m := range models.Models {
slog.Info("loaded model snapshot", "model", m)
var procStr string
switch {
case m.SizeVRAM == 0:
procStr = "100% CPU"
case m.SizeVRAM == m.Size:
procStr = "100% GPU"
case m.SizeVRAM > m.Size || m.Size == 0:
procStr = "Unknown"
default:
sizeCPU := m.Size - m.SizeVRAM
cpuPercent := math.Round(float64(sizeCPU) / float64(m.Size) * 100)
procStr = fmt.Sprintf("%d%%/%d%%", int(cpuPercent), int(100-cpuPercent))
}
slog.Info("loaded model snapshot", "model", m.Name, "CPU/GPU", procStr, "expires", format.HumanTime(m.ExpiresAt, "Never"))
}
}
}

View File

@@ -4,6 +4,8 @@ package integration
import (
"context"
"log/slog"
"sync"
"testing"
"time"
@@ -19,9 +21,14 @@ func TestLongInputContext(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
// Set up the test data
req := api.GenerateRequest{
Model: "llama2",
Prompt: "Oh, dont speak to me of Austria. Perhaps I dont understand things, but Austria never has wished, and does not wish, for war. She is betraying us! Russia alone must save Europe. Our gracious sovereign recognizes his high vocation and will be true to it. That is the one thing I have faith in! Our good and wonderful sovereign has to perform the noblest role on earth, and he is so virtuous and noble that God will not forsake him. He will fulfill his vocation and crush the hydra of revolution, which has become more terrible than ever in the person of this murderer and villain! We alone must avenge the blood of the just one.... Whom, I ask you, can we rely on?... England with her commercial spirit will not and cannot understand the Emperor Alexanders loftiness of soul. She has refused to evacuate Malta. She wanted to find, and still seeks, some secret motive in our actions. What answer did Novosíltsev get? None. The English have not understood and cannot understand the self-abnegation of our Emperor who wants nothing for himself, but only desires the good of mankind. And what have they promised? Nothing! And what little they have promised they will not perform! Prussia has always declared that Buonaparte is invincible, and that all Europe is powerless before him.... And I dont believe a word that Hardenburg says, or Haugwitz either. This famous Prussian neutrality is just a trap. I have faith only in God and the lofty destiny of our adored monarch. He will save Europe! What country is this referring to?",
req := api.ChatRequest{
Model: smol,
Messages: []api.Message{
{
Role: "user",
Content: "Oh, dont speak to me of Austria. Perhaps I dont understand things, but Austria never has wished, and does not wish, for war. She is betraying us! Russia alone must save Europe. Our gracious sovereign recognizes his high vocation and will be true to it. That is the one thing I have faith in! Our good and wonderful sovereign has to perform the noblest role on earth, and he is so virtuous and noble that God will not forsake him. He will fulfill his vocation and crush the hydra of revolution, which has become more terrible than ever in the person of this murderer and villain! We alone must avenge the blood of the just one.... Whom, I ask you, can we rely on?... England with her commercial spirit will not and cannot understand the Emperor Alexanders loftiness of soul. She has refused to evacuate Malta. She wanted to find, and still seeks, some secret motive in our actions. What answer did Novosíltsev get? None. The English have not understood and cannot understand the self-abnegation of our Emperor who wants nothing for himself, but only desires the good of mankind. And what have they promised? Nothing! And what little they have promised they will not perform! Prussia has always declared that Buonaparte is invincible, and that all Europe is powerless before him.... And I dont believe a word that Hardenburg says, or Haugwitz either. This famous Prussian neutrality is just a trap. I have faith only in God and the lofty destiny of our adored monarch. He will save Europe! What country is this referring to?",
},
},
Stream: &stream,
Options: map[string]any{
"temperature": 0,
@@ -34,7 +41,7 @@ func TestLongInputContext(t *testing.T) {
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatalf("PullIfMissing failed: %v", err)
}
DoGenerate(ctx, t, client, req, []string{"russia", "germany", "france", "england", "austria", "prussia"}, 120*time.Second, 10*time.Second)
DoChat(ctx, t, client, req, []string{"russia", "german", "france", "england", "austria", "prussia", "europe", "individuals", "coalition", "conflict"}, 120*time.Second, 10*time.Second)
}
func TestContextExhaustion(t *testing.T) {
@@ -46,9 +53,14 @@ func TestContextExhaustion(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
// Set up the test data
req := api.GenerateRequest{
Model: "llama2",
Prompt: "Write me a story with a ton of emojis?",
req := api.ChatRequest{
Model: smol,
Messages: []api.Message{
{
Role: "user",
Content: "Write me a story in english with a lot of emojis",
},
},
Stream: &stream,
Options: map[string]any{
"temperature": 0,
@@ -61,5 +73,212 @@ func TestContextExhaustion(t *testing.T) {
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatalf("PullIfMissing failed: %v", err)
}
DoGenerate(ctx, t, client, req, []string{"once", "upon", "lived"}, 120*time.Second, 10*time.Second)
DoChat(ctx, t, client, req, []string{"once", "upon", "lived", "sunny", "cloudy", "clear", "water", "time", "travel", "world"}, 120*time.Second, 10*time.Second)
}
// Send multiple generate requests with prior context and ensure the response is coherant and expected
func TestParallelGenerateWithHistory(t *testing.T) {
modelName := "gpt-oss:20b"
req, resp := GenerateRequests()
numParallel := 2
iterLimit := 2
softTimeout, hardTimeout := getTimeouts(t)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
initialTimeout := 120 * time.Second
streamTimeout := 20 * time.Second
// Get the server running (if applicable) warm the model up with a single initial request
slog.Info("loading", "model", modelName)
err := client.Generate(ctx,
&api.GenerateRequest{Model: modelName, KeepAlive: &api.Duration{Duration: 10 * time.Second}},
func(response api.GenerateResponse) error { return nil },
)
if err != nil {
t.Fatalf("failed to load model %s: %s", modelName, err)
}
gpuPercent := getGPUPercent(ctx, t, client, modelName)
if gpuPercent < 80 {
slog.Warn("Low GPU percentage - increasing timeouts", "percent", gpuPercent)
initialTimeout = 240 * time.Second
streamTimeout = 30 * time.Second
}
var wg sync.WaitGroup
wg.Add(numParallel)
for i := range numParallel {
go func(i int) {
defer wg.Done()
k := i % len(req)
req[k].Model = modelName
for j := 0; j < iterLimit; j++ {
if time.Now().Sub(started) > softTimeout {
slog.Info("exceeded soft timeout, winding down test")
return
}
slog.Info("Starting", "thread", i, "iter", j)
// On slower GPUs it can take a while to process the concurrent requests
// so we allow a much longer initial timeout
c := DoGenerate(ctx, t, client, req[k], resp[k], initialTimeout, streamTimeout)
req[k].Context = c
req[k].Prompt = "tell me more!"
}
}(i)
}
wg.Wait()
}
// Send generate requests with prior context and ensure the response is coherant and expected
func TestGenerateWithHistory(t *testing.T) {
req := api.GenerateRequest{
Model: smol,
Prompt: rainbowPrompt,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"num_ctx": 16384,
},
}
softTimeout, hardTimeout := getTimeouts(t)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
// Get the server running (if applicable) warm the model up with a single initial request
slog.Info("loading", "model", req.Model)
err := client.Generate(ctx,
&api.GenerateRequest{Model: req.Model, KeepAlive: &api.Duration{Duration: 10 * time.Second}, Options: req.Options},
func(response api.GenerateResponse) error { return nil },
)
if err != nil {
t.Fatalf("failed to load model %s: %s", req.Model, err)
}
req.Context = DoGenerate(ctx, t, client, req, rainbowExpected, 30*time.Second, 20*time.Second)
for i := 0; i < len(rainbowFollowups); i++ {
req.Prompt = rainbowFollowups[i]
if time.Now().Sub(started) > softTimeout {
slog.Info("exceeded soft timeout, winding down test")
return
}
req.Context = DoGenerate(ctx, t, client, req, rainbowExpected, 30*time.Second, 20*time.Second)
}
}
// Send multiple chat requests with prior context and ensure the response is coherant and expected
func TestParallelChatWithHistory(t *testing.T) {
modelName := "gpt-oss:20b"
req, resp := ChatRequests()
numParallel := 2
iterLimit := 2
softTimeout, hardTimeout := getTimeouts(t)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
initialTimeout := 120 * time.Second
streamTimeout := 20 * time.Second
// Get the server running (if applicable) warm the model up with a single initial empty request
slog.Info("loading", "model", modelName)
err := client.Generate(ctx,
&api.GenerateRequest{Model: modelName, KeepAlive: &api.Duration{Duration: 10 * time.Second}},
func(response api.GenerateResponse) error { return nil },
)
if err != nil {
t.Fatalf("failed to load model %s: %s", modelName, err)
}
gpuPercent := getGPUPercent(ctx, t, client, modelName)
if gpuPercent < 80 {
slog.Warn("Low GPU percentage - increasing timeouts", "percent", gpuPercent)
initialTimeout = 240 * time.Second
streamTimeout = 30 * time.Second
}
var wg sync.WaitGroup
wg.Add(numParallel)
for i := range numParallel {
go func(i int) {
defer wg.Done()
k := i % len(req)
req[k].Model = modelName
for j := 0; j < iterLimit; j++ {
if time.Now().Sub(started) > softTimeout {
slog.Info("exceeded soft timeout, winding down test")
return
}
slog.Info("Starting", "thread", i, "iter", j)
// On slower GPUs it can take a while to process the concurrent requests
// so we allow a much longer initial timeout
assistant := DoChat(ctx, t, client, req[k], resp[k], initialTimeout, streamTimeout)
if assistant == nil {
t.Fatalf("didn't get an assistant response for context")
}
req[k].Messages = append(req[k].Messages,
*assistant,
api.Message{Role: "user", Content: "tell me more!"},
)
}
}(i)
}
wg.Wait()
}
// Send generate requests with prior context and ensure the response is coherant and expected
func TestChatWithHistory(t *testing.T) {
req := api.ChatRequest{
Model: smol,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"num_ctx": 16384,
},
Messages: []api.Message{
{
Role: "user",
Content: rainbowPrompt,
},
},
}
softTimeout, hardTimeout := getTimeouts(t)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
// Get the server running (if applicable) warm the model up with a single initial request
slog.Info("loading", "model", req.Model)
err := client.Generate(ctx,
&api.GenerateRequest{Model: req.Model, KeepAlive: &api.Duration{Duration: 10 * time.Second}, Options: req.Options},
func(response api.GenerateResponse) error { return nil },
)
if err != nil {
t.Fatalf("failed to load model %s: %s", req.Model, err)
}
assistant := DoChat(ctx, t, client, req, rainbowExpected, 30*time.Second, 20*time.Second)
for i := 0; i < len(rainbowFollowups); i++ {
if time.Now().Sub(started) > softTimeout {
slog.Info("exceeded soft timeout, winding down test")
return
}
req.Messages = append(req.Messages,
*assistant,
api.Message{Role: "user", Content: rainbowFollowups[i]},
)
assistant = DoChat(ctx, t, client, req, rainbowExpected, 30*time.Second, 20*time.Second)
if assistant == nil {
t.Fatalf("didn't get an assistant response for context")
}
}
}

View File

@@ -8,6 +8,7 @@ import (
"testing"
"time"
"github.com/google/go-cmp/cmp"
"github.com/ollama/ollama/api"
)
@@ -38,14 +39,14 @@ func TestAllMiniLMEmbeddings(t *testing.T) {
defer cleanup()
req := api.EmbeddingRequest{
Model: "all-minilm",
Prompt: "why is the sky blue?",
Model: "all-minilm",
Prompt: "why is the sky blue?",
KeepAlive: &api.Duration{Duration: 10 * time.Second},
}
res, err := embeddingTestHelper(ctx, client, t, req)
if err != nil {
t.Fatalf("error: %v", err)
t.Fatal(err)
}
if len(res.Embedding) != 384 {
@@ -73,9 +74,8 @@ func TestAllMiniLMEmbed(t *testing.T) {
}
res, err := embedTestHelper(ctx, client, t, req)
if err != nil {
t.Fatalf("error: %v", err)
t.Fatal(err)
}
if len(res.Embeddings) != 1 {
@@ -111,9 +111,8 @@ func TestAllMiniLMBatchEmbed(t *testing.T) {
}
res, err := embedTestHelper(ctx, client, t, req)
if err != nil {
t.Fatalf("error: %v", err)
t.Fatal(err)
}
if len(res.Embeddings) != 2 {
@@ -155,93 +154,148 @@ func TestAllMiniLMEmbedTruncate(t *testing.T) {
truncTrue, truncFalse := true, false
type testReq struct {
Name string
Request api.EmbedRequest
want, err := embedTestHelper(ctx, client, t, api.EmbedRequest{
Model: "all-minilm",
Input: "why",
})
if err != nil {
t.Fatal(err)
}
reqs := []testReq{
cases := []struct {
name string
request api.EmbedRequest
check func(*api.EmbedResponse, error)
}{
{
Name: "Target Truncation",
Request: api.EmbedRequest{
name: "target truncation",
request: api.EmbedRequest{
Model: "all-minilm",
Input: "why",
},
},
{
Name: "Default Truncate",
Request: api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue?",
Options: map[string]any{"num_ctx": 1},
check: func(got *api.EmbedResponse, err error) {
if err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(want.Embeddings[0], got.Embeddings[0]); diff != "" {
t.Errorf("embedding mismatch (-want +got):\n%s", diff)
}
},
},
{
Name: "Explicit Truncate",
Request: api.EmbedRequest{
name: "default truncate",
request: api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue?",
Options: map[string]any{"num_ctx": 3},
},
check: func(got *api.EmbedResponse, err error) {
if err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(want.Embeddings[0], got.Embeddings[0]); diff != "" {
t.Errorf("embedding mismatch (-want +got):\n%s", diff)
}
},
},
{
name: "explicit truncate",
request: api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue?",
Truncate: &truncTrue,
Options: map[string]any{"num_ctx": 3},
},
check: func(got *api.EmbedResponse, err error) {
if err != nil {
t.Fatal(err)
}
if diff := cmp.Diff(want.Embeddings[0], got.Embeddings[0]); diff != "" {
t.Errorf("embedding mismatch (-want +got):\n%s", diff)
}
},
},
{
name: "truncate error",
request: api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue?",
Truncate: &truncFalse,
Options: map[string]any{"num_ctx": 3},
},
check: func(res *api.EmbedResponse, err error) {
if err.Error() != "input exceeds maximum context length" {
t.Fatalf("expected truncation error, got: %v", err)
}
},
},
{
name: "input after truncate error",
request: api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue?",
Truncate: &truncTrue,
Options: map[string]any{"num_ctx": 1},
},
check: func(res *api.EmbedResponse, err error) {
if err.Error() != "input after truncation exceeds maximum context length" {
t.Fatalf("expected truncation error, got: %v", err)
}
},
},
{
name: "input after truncate error",
request: api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue?",
Truncate: &truncTrue,
Options: map[string]any{"num_ctx": 0},
},
check: func(res *api.EmbedResponse, err error) {
if err.Error() != "input after truncation exceeds maximum context length" {
t.Fatalf("expected truncation error, got: %v", err)
}
},
},
{
name: "boundary truncation",
request: api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue? Why is the sky blue? hi there my",
Options: map[string]any{"num_ctx": 16},
},
check: func(res *api.EmbedResponse, err error) {
if err != nil {
t.Fatal(err)
}
},
},
}
res := make(map[string]*api.EmbedResponse)
for _, req := range reqs {
response, err := embedTestHelper(ctx, client, t, req.Request)
if err != nil {
t.Fatalf("error: %v", err)
}
res[req.Name] = response
}
if res["Target Truncation"].Embeddings[0][0] != res["Default Truncate"].Embeddings[0][0] {
t.Fatal("expected default request to truncate correctly")
}
if res["Default Truncate"].Embeddings[0][0] != res["Explicit Truncate"].Embeddings[0][0] {
t.Fatal("expected default request and truncate true request to be the same")
}
// check that truncate set to false returns an error if context length is exceeded
_, err := embedTestHelper(ctx, client, t, api.EmbedRequest{
Model: "all-minilm",
Input: "why is the sky blue?",
Truncate: &truncFalse,
Options: map[string]any{"num_ctx": 1},
})
if err == nil {
t.Fatal("expected error, got nil")
for _, req := range cases {
t.Run(req.name, func(t *testing.T) {
req.check(embedTestHelper(ctx, client, t, req.request))
})
}
}
func embeddingTestHelper(ctx context.Context, client *api.Client, t *testing.T, req api.EmbeddingRequest) (*api.EmbeddingResponse, error) {
t.Helper()
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatalf("failed to pull model %s: %v", req.Model, err)
t.Fatal(err)
}
response, err := client.Embeddings(ctx, &req)
if err != nil {
return nil, err
}
return response, nil
return client.Embeddings(ctx, &req)
}
func embedTestHelper(ctx context.Context, client *api.Client, t *testing.T, req api.EmbedRequest) (*api.EmbedResponse, error) {
t.Helper()
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatalf("failed to pull model %s: %v", req.Model, err)
t.Fatal(err)
}
response, err := client.Embed(ctx, &req)
if err != nil {
return nil, err
}
return response, nil
return client.Embed(ctx, &req)
}

View File

@@ -4,7 +4,9 @@ package integration
import (
"context"
"fmt"
"log/slog"
"os"
"testing"
"time"
@@ -13,13 +15,14 @@ import (
// First run of this scenario on a target system will take a long time to download
// ~1.5TB of models. Set a sufficiently large -timeout for your network speed
func TestLibraryModelsGenerate(t *testing.T) {
func TestLibraryModelsChat(t *testing.T) {
softTimeout, hardTimeout := getTimeouts(t)
slog.Info("Setting timeouts", "soft", softTimeout, "hard", hardTimeout)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
targetArch := os.Getenv("OLLAMA_TEST_ARCHITECTURE")
chatModels := libraryChatModels
for _, model := range chatModels {
@@ -30,28 +33,43 @@ func TestLibraryModelsGenerate(t *testing.T) {
if err := PullIfMissing(ctx, client, model); err != nil {
t.Fatalf("pull failed %s", err)
}
req := api.GenerateRequest{
Model: model,
Prompt: "why is the sky blue?",
if targetArch != "" {
resp, err := client.Show(ctx, &api.ShowRequest{Name: model})
if err != nil {
t.Fatalf("unable to show model: %s", err)
}
arch := resp.ModelInfo["general.architecture"].(string)
if arch != targetArch {
t.Skip(fmt.Sprintf("Skipping %s architecture %s != %s", model, arch, targetArch))
}
}
req := api.ChatRequest{
Model: model,
Messages: []api.Message{
{
Role: "user",
Content: blueSkyPrompt,
},
},
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]interface{}{
"temperature": 0.1,
"seed": 123,
},
}
anyResp := []string{"rayleigh", "scatter", "atmosphere", "nitrogen", "oxygen", "wavelength"}
anyResp := blueSkyExpected
// Special cases
if model == "duckdb-nsql" {
anyResp = []string{"select", "from"}
} else if model == "granite3-guardian" || model == "shieldgemma" || model == "llama-guard3" || model == "bespoke-minicheck" {
anyResp = []string{"yes", "no", "safe", "unsafe"}
} else if model == "openthinker" || model == "nexusraven" {
} else if model == "openthinker" {
anyResp = []string{"plugin", "im_sep", "components", "function call"}
} else if model == "starcoder" || model == "starcoder2" || model == "magicoder" || model == "deepseek-coder" {
req.Prompt = "def fibonacci():"
req.Messages[0].Content = "def fibonacci():"
anyResp = []string{"f(n)", "sequence", "n-1", "main()", "__main__", "while"}
}
DoGenerate(ctx, t, client, req, anyResp, 120*time.Second, 30*time.Second)
DoChat(ctx, t, client, req, anyResp, 120*time.Second, 30*time.Second)
})
}
}

View File

@@ -9,7 +9,6 @@ import (
"time"
"github.com/ollama/ollama/api"
"github.com/stretchr/testify/require"
)
func TestVisionModels(t *testing.T) {
@@ -32,18 +31,25 @@ func TestVisionModels(t *testing.T) {
for _, v := range testCases {
t.Run(v.model, func(t *testing.T) {
image, err := base64.StdEncoding.DecodeString(imageEncoding)
require.NoError(t, err)
req := api.GenerateRequest{
Model: v.model,
Prompt: "what does the text in this image say?",
if err != nil {
t.Fatal(err)
}
req := api.ChatRequest{
Model: v.model,
Messages: []api.Message{
{
Role: "user",
Content: "what does the text in this image say?",
Images: []api.ImageData{
image,
},
},
},
Stream: &stream,
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
Images: []api.ImageData{
image,
},
}
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
@@ -52,9 +58,18 @@ func TestVisionModels(t *testing.T) {
// Note: sometimes it returns "the ollamas" sometimes "the ollams"
resp := "the ollam"
defer cleanup()
require.NoError(t, PullIfMissing(ctx, client, req.Model))
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatal(err)
}
// Preload to skip if we're less than 80% on GPU to avoid extremely slow tests
err = client.Generate(ctx, &api.GenerateRequest{Model: req.Model}, func(response api.GenerateResponse) error { return nil })
if err != nil {
t.Fatalf("failed to load model %s: %s", req.Model, err)
}
skipIfNotGPULoaded(ctx, t, client, req.Model, 80)
// llava models on CPU can be quite slow to start
DoGenerate(ctx, t, client, req, []string{resp}, 240*time.Second, 30*time.Second)
DoChat(ctx, t, client, req, []string{resp}, 240*time.Second, 30*time.Second)
})
}
}
@@ -62,7 +77,9 @@ func TestVisionModels(t *testing.T) {
func TestIntegrationSplitBatch(t *testing.T) {
skipUnderMinVRAM(t, 6)
image, err := base64.StdEncoding.DecodeString(imageEncoding)
require.NoError(t, err)
if err != nil {
t.Fatal(err)
}
req := api.GenerateRequest{
Model: "gemma3:4b",
// Fill up a chunk of the batch so the image will partially spill over into the next one
@@ -84,7 +101,9 @@ func TestIntegrationSplitBatch(t *testing.T) {
defer cancel()
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
require.NoError(t, PullIfMissing(ctx, client, req.Model))
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatal(err)
}
// llava models on CPU can be quite slow to start,
DoGenerate(ctx, t, client, req, []string{resp}, 120*time.Second, 30*time.Second)
}

View File

@@ -1,47 +0,0 @@
//go:build integration
package integration
import (
"context"
"testing"
"time"
"github.com/ollama/ollama/api"
)
// TODO - this would ideally be in the llm package, but that would require some refactoring of interfaces in the server
// package to avoid circular dependencies
var (
stream = false
req = [2]api.GenerateRequest{
{
Model: smol,
Prompt: "why is the ocean blue?",
Stream: &stream,
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
}, {
Model: smol,
Prompt: "what is the origin of the us thanksgiving holiday?",
Stream: &stream,
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
},
}
resp = [2][]string{
{"sunlight", "scattering", "interact"},
{"england", "english", "massachusetts", "pilgrims"},
}
)
func TestIntegrationSimple(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), time.Second*120)
defer cancel()
GenerateTestHelper(ctx, t, req[0], resp[0])
}

View File

@@ -13,12 +13,12 @@ import (
"testing"
"time"
"github.com/stretchr/testify/require"
"github.com/ollama/ollama/api"
)
func TestMaxQueue(t *testing.T) {
t.Skip("this test needs to be re-evaluated to use a proper embedding model")
if os.Getenv("OLLAMA_TEST_EXISTING") != "" {
t.Skip("Max Queue test requires spawning a local server so we can adjust the queue size")
return
@@ -45,7 +45,9 @@ func TestMaxQueue(t *testing.T) {
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
require.NoError(t, PullIfMissing(ctx, client, req.Model))
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatal(err)
}
// Context for the worker threads so we can shut them down
// embedCtx, embedCancel := context.WithCancel(ctx)
@@ -89,7 +91,9 @@ func TestMaxQueue(t *testing.T) {
switch {
case genErr == nil:
successCount++
require.Greater(t, len(resp.Embedding), 5) // somewhat arbitrary, but sufficient to be reasonable
if len(resp.Embedding) < 5 { // somewhat arbitrary, but sufficient to be reasonable
t.Fatalf("embeddings shorter than expected: %d", len(resp.Embedding))
}
case errors.Is(genErr, context.Canceled):
canceledCount++
case strings.Contains(genErr.Error(), "busy"):
@@ -97,7 +101,9 @@ func TestMaxQueue(t *testing.T) {
case strings.Contains(genErr.Error(), "connection reset by peer"):
resetByPeerCount++
default:
require.NoError(t, genErr, "%d request failed", i)
if genErr != nil {
t.Fatalf("%d request failed", i)
}
}
slog.Info("embed finished", "id", i)
@@ -108,8 +114,13 @@ func TestMaxQueue(t *testing.T) {
embedwg.Wait()
slog.Info("embeds completed", "success", successCount, "busy", busyCount, "reset", resetByPeerCount, "canceled", canceledCount)
require.Equal(t, resetByPeerCount, 0, "Connections reset by peer, have you updated your fd and socket limits?")
require.True(t, busyCount > 0, "no requests hit busy error but some should have")
require.True(t, canceledCount == 0, "no requests should have been canceled due to timeout")
if resetByPeerCount != 0 {
t.Fatalf("Connections reset by peer, have you updated your fd and socket limits? %d", resetByPeerCount)
}
if busyCount == 0 {
t.Fatalf("no requests hit busy error but some should have")
}
if canceledCount > 0 {
t.Fatalf("no requests should have been canceled due to timeout %d", canceledCount)
}
}

View File

@@ -19,7 +19,7 @@ import (
"github.com/ollama/ollama/format"
)
func TestModelsGenerate(t *testing.T) {
func TestModelsChat(t *testing.T) {
softTimeout, hardTimeout := getTimeouts(t)
slog.Info("Setting timeouts", "soft", softTimeout, "hard", hardTimeout)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
@@ -65,17 +65,41 @@ func TestModelsGenerate(t *testing.T) {
}
}
}
initialTimeout := 120 * time.Second
streamTimeout := 30 * time.Second
slog.Info("loading", "model", model)
err := client.Generate(ctx,
&api.GenerateRequest{Model: model, KeepAlive: &api.Duration{Duration: 10 * time.Second}},
func(response api.GenerateResponse) error { return nil },
)
if err != nil {
t.Fatalf("failed to load model %s: %s", model, err)
}
gpuPercent := getGPUPercent(ctx, t, client, model)
if gpuPercent < 80 {
slog.Warn("Low GPU percentage - increasing timeouts", "percent", gpuPercent)
initialTimeout = 240 * time.Second
streamTimeout = 40 * time.Second
}
// TODO - fiddle with context size
req := api.GenerateRequest{
Model: model,
Prompt: "why is the sky blue?",
req := api.ChatRequest{
Model: model,
Messages: []api.Message{
{
Role: "user",
Content: blueSkyPrompt,
},
},
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]interface{}{
"temperature": 0,
"seed": 123,
},
}
anyResp := []string{"rayleigh", "scattering", "atmosphere", "nitrogen", "oxygen"}
DoGenerate(ctx, t, client, req, anyResp, 120*time.Second, 30*time.Second)
DoChat(ctx, t, client, req, blueSkyExpected, initialTimeout, streamTimeout)
// best effort unload once we're done with the model
client.Generate(ctx, &api.GenerateRequest{Model: req.Model, KeepAlive: &api.Duration{Duration: 0}}, func(rsp api.GenerateResponse) error { return nil })
})
}
}
@@ -129,8 +153,9 @@ func TestModelsEmbed(t *testing.T) {
}
}
req := api.EmbeddingRequest{
Model: model,
Prompt: "why is the sky blue?",
Model: model,
Prompt: "why is the sky blue?",
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]interface{}{
"temperature": 0,
"seed": 123,
@@ -140,6 +165,10 @@ func TestModelsEmbed(t *testing.T) {
if err != nil {
t.Fatalf("embeddings call failed %s", err)
}
defer func() {
// best effort unload once we're done with the model
client.Generate(ctx, &api.GenerateRequest{Model: req.Model, KeepAlive: &api.Duration{Duration: 0}}, func(rsp api.GenerateResponse) error { return nil })
}()
if len(resp.Embedding) == 0 {
t.Errorf("zero length embedding response")
}

View File

@@ -40,6 +40,18 @@ var (
// cat int.log | grep MODEL_PERF_HEADER | head -1| cut -f2- -d: > perf.csv
// cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
func TestModelsPerf(t *testing.T) {
if s := os.Getenv("OLLAMA_NEW_ENGINE"); s != "" {
doModelPerfTest(t, ollamaEngineChatModels)
} else {
doModelPerfTest(t, append(ollamaEngineChatModels, llamaRunnerChatModels...))
}
}
func TestLibraryModelsPerf(t *testing.T) {
doModelPerfTest(t, libraryChatModels)
}
func doModelPerfTest(t *testing.T, chatModels []string) {
softTimeout, hardTimeout := getTimeouts(t)
slog.Info("Setting timeouts", "soft", softTimeout, "hard", hardTimeout)
ctx, cancel := context.WithTimeout(context.Background(), hardTimeout)
@@ -65,14 +77,12 @@ func TestModelsPerf(t *testing.T) {
}
longPrompt := "summarize the following: " + string(data)
var chatModels []string
if s := os.Getenv("OLLAMA_NEW_ENGINE"); s != "" {
chatModels = ollamaEngineChatModels
} else {
chatModels = append(ollamaEngineChatModels, llamaRunnerChatModels...)
}
targetArch := os.Getenv("OLLAMA_TEST_ARCHITECTURE")
for _, model := range chatModels {
if !strings.Contains(model, ":") {
model = model + ":latest"
}
t.Run(model, func(t *testing.T) {
if time.Now().Sub(started) > softTimeout {
t.Skip("skipping remaining tests to avoid excessive runtime")
@@ -88,6 +98,9 @@ func TestModelsPerf(t *testing.T) {
}
arch := resp.ModelInfo["general.architecture"].(string)
maxContext = int(resp.ModelInfo[fmt.Sprintf("%s.context_length", arch)].(float64))
if targetArch != "" && arch != targetArch {
t.Skip(fmt.Sprintf("Skipping %s architecture %s != %s", model, arch, targetArch))
}
if maxVram > 0 {
resp, err := client.List(ctx)
@@ -148,11 +161,12 @@ func TestModelsPerf(t *testing.T) {
}
testCases := []struct {
name string
prompt string
anyResp []string
}{
{"why is the sky blue?", []string{"rayleigh", "scattering", "atmosphere", "nitrogen", "oxygen"}},
{maxPrompt, []string{"shakespeare", "oppression", "sorrows", "gutenberg", "child", "license", "sonnet", "melancholy"}},
{"blue_sky", blueSkyPrompt, blueSkyExpected},
{"max", maxPrompt, []string{"shakespeare", "oppression", "sorrows", "gutenberg", "child", "license", "sonnet", "melancholy", "love", "sorrow", "beauty"}},
}
var gpuPercent int
for _, tc := range testCases {
@@ -160,9 +174,14 @@ func TestModelsPerf(t *testing.T) {
slog.Info("skipping long prompt", "model", model, "num_ctx", numCtx, "gpu_percent", gpuPercent)
continue
}
req := api.GenerateRequest{
Model: model,
Prompt: tc.prompt,
req := api.ChatRequest{
Model: model,
Messages: []api.Message{
{
Role: "user",
Content: tc.prompt,
},
},
KeepAlive: &api.Duration{Duration: 20 * time.Second}, // long enough to ensure a ps returns
Options: map[string]interface{}{
"temperature": 0,
@@ -171,7 +190,7 @@ func TestModelsPerf(t *testing.T) {
},
}
atLeastOne := false
var resp api.GenerateResponse
var resp api.ChatResponse
stream := false
req.Stream = &stream
@@ -185,7 +204,7 @@ func TestModelsPerf(t *testing.T) {
)
defer cancel()
err = client.Generate(genCtx, &req, func(rsp api.GenerateResponse) error {
err = client.Chat(genCtx, &req, func(rsp api.ChatResponse) error {
resp = rsp
return nil
})
@@ -201,13 +220,13 @@ func TestModelsPerf(t *testing.T) {
}
loaded = true
for _, expResp := range tc.anyResp {
if strings.Contains(strings.ToLower(resp.Response), expResp) {
if strings.Contains(strings.ToLower(resp.Message.Content), expResp) {
atLeastOne = true
break
}
}
if !atLeastOne {
t.Fatalf("response didn't contain expected values: ctx:%d expected:%v response:%s ", numCtx, tc.anyResp, resp.Response)
t.Fatalf("response didn't contain expected values: ctx:%d expected:%v response:%s ", numCtx, tc.anyResp, resp.Message.Content)
}
models, err := client.ListRunning(ctx)
if err != nil {
@@ -241,24 +260,20 @@ func TestModelsPerf(t *testing.T) {
}
}
}
fmt.Fprintf(os.Stderr, "MODEL_PERF_HEADER:%s,%s,%s,%s,%s,%s,%s\n",
"MODEL",
"CONTEXT",
"GPU PERCENT",
"PROMPT COUNT",
"LOAD TIME",
"PROMPT EVAL TPS",
"EVAL TPS",
)
fmt.Fprintf(os.Stderr, "MODEL_PERF_DATA:%s,%d,%d,%d,%0.2f,%0.2f,%0.2f\n",
model,
numCtx,
gpuPercent,
resp.PromptEvalCount,
float64(resp.LoadDuration)/1000000000.0,
float64(resp.PromptEvalCount)/(float64(resp.PromptEvalDuration)/1000000000.0),
float64(resp.EvalCount)/(float64(resp.EvalDuration)/1000000000.0),
)
prefillTimePerToken := float64(resp.PromptEvalDuration.Nanoseconds()) / float64(resp.PromptEvalCount)
prefillTokensPerSec := float64(resp.PromptEvalCount) / (float64(resp.PromptEvalDuration.Nanoseconds()) + 1e-12) * 1e9
fmt.Fprintf(os.Stderr, "BenchmarkModel/name=%s-%s/%d/step=%s %d %.2f ns/token %.2f token/sec\n",
model, tc.name, numCtx, "prefill", resp.PromptEvalCount, prefillTimePerToken, prefillTokensPerSec)
evalTimePerToken := float64(resp.EvalDuration.Nanoseconds()) / float64(resp.EvalCount)
evalTokensPerSec := float64(resp.EvalCount) / (float64(resp.EvalDuration.Nanoseconds()) + 1e-12) * 1e9
fmt.Fprintf(os.Stderr, "BenchmarkModel/name=%s-%s/%d/step=%s %d %.2f ns/token %.2f token/sec\n",
model, tc.name, numCtx, "generate", resp.EvalCount, evalTimePerToken, evalTokensPerSec)
fmt.Fprintf(os.Stderr, "BenchmarkMode/name=%s-%s/%d 1 %d ns/request\n",
model, tc.name, numCtx, resp.TotalDuration.Nanoseconds())
fmt.Fprintf(os.Stderr, "BenchmarkMode/name=%s-%s/%d/step=%s 1 %d ns/request\n",
model, tc.name, numCtx, "load", resp.LoadDuration.Nanoseconds())
}
}
})

View File

@@ -74,9 +74,14 @@ func TestQuantization(t *testing.T) {
}
stream := true
genReq := api.GenerateRequest{
Model: newName,
Prompt: "why is the sky blue?",
chatReq := api.ChatRequest{
Model: newName,
Messages: []api.Message{
{
Role: "user",
Content: blueSkyPrompt,
},
},
KeepAlive: &api.Duration{Duration: 3 * time.Second},
Options: map[string]any{
"seed": 42,
@@ -88,14 +93,13 @@ func TestQuantization(t *testing.T) {
// Some smaller quantizations can cause models to have poor quality
// or get stuck in repetition loops, so we stop as soon as we have any matches
anyResp := []string{"rayleigh", "scattering", "day", "sun", "moon", "color", "nitrogen", "oxygen"}
reqCtx, reqCancel := context.WithCancel(ctx)
atLeastOne := false
var buf bytes.Buffer
genfn := func(response api.GenerateResponse) error {
buf.Write([]byte(response.Response))
chatfn := func(response api.ChatResponse) error {
buf.Write([]byte(response.Message.Content))
fullResp := strings.ToLower(buf.String())
for _, resp := range anyResp {
for _, resp := range blueSkyExpected {
if strings.Contains(fullResp, resp) {
atLeastOne = true
t.Log(fullResp)
@@ -109,14 +113,14 @@ func TestQuantization(t *testing.T) {
done := make(chan int)
var genErr error
go func() {
genErr = client.Generate(reqCtx, &genReq, genfn)
genErr = client.Chat(reqCtx, &chatReq, chatfn)
done <- 0
}()
select {
case <-done:
if genErr != nil && !atLeastOne {
t.Fatalf("failed with %s request prompt %s ", genReq.Model, genReq.Prompt)
t.Fatalf("failed with %s request prompt %s ", chatReq.Model, chatReq.Messages[0].Content)
}
case <-ctx.Done():
t.Error("outer test context done while waiting for generate")

View File

File diff suppressed because one or more lines are too long

View File

@@ -9,11 +9,13 @@ import (
"fmt"
"io"
"log/slog"
"math"
"math/rand"
"net"
"net/http"
"net/url"
"os"
"os/exec"
"path/filepath"
"runtime"
"strconv"
@@ -23,13 +25,12 @@ import (
"time"
"github.com/ollama/ollama/api"
"github.com/ollama/ollama/app/lifecycle"
"github.com/ollama/ollama/format"
"github.com/stretchr/testify/require"
)
const (
smol = "llama3.2:1b"
var (
smol = "llama3.2:1b"
stream = false
)
var (
@@ -37,6 +38,8 @@ var (
// Note: add newer models at the top of the list to test them first
ollamaEngineChatModels = []string{
"qwen3-coder:30b",
"gpt-oss:20b",
"gemma3n:e2b",
"mistral-small3.2:latest",
"deepseek-r1:1.5b",
@@ -44,6 +47,7 @@ var (
"qwen2.5-coder:latest",
"qwen2.5vl:3b",
"qwen3:0.6b", // dense
"qwen3:1.7b", // dense
"qwen3:30b", // MOE
"gemma3:1b",
"llama3.1:latest",
@@ -126,6 +130,7 @@ var (
"gemma3n",
"glm4",
"goliath",
"gpt-oss:20b",
"granite-code",
"granite3-dense",
"granite3-guardian",
@@ -253,10 +258,30 @@ var (
"snowflake-arctic-embed",
"snowflake-arctic-embed2",
}
blueSkyPrompt = "why is the sky blue? Be brief but factual in your reply"
blueSkyExpected = []string{"rayleigh", "scatter", "atmosphere", "nitrogen", "oxygen", "wavelength", "interact"}
rainbowPrompt = "how do rainbows form? Be brief but factual in your reply"
rainbowFollowups = []string{
"Explain the physics involved in them. Be breif in your reply",
"Explain the chemistry involved in them. Be breif in your reply",
"What are common myths related to them? Be brief in your reply",
"Can they form if there is no rain? Be breif in your reply",
"Can they form if there are no clouds? Be breif in your reply",
"Do they happen on other planets? Be brief in your reply",
}
rainbowExpected = []string{"water", "droplet", "mist", "glow", "refract", "reflect", "scatter", "particles", "wave", "color", "spectrum", "raindrop", "atmosphere", "frequency", "shower", "sky", "shimmer", "light", "storm", "sunny", "sunburst", "phenomenon", "mars", "venus", "jupiter"}
)
func Init() {
lifecycle.InitLogging()
func init() {
logger := slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelDebug}))
slog.SetDefault(logger)
custom := os.Getenv("OLLAMA_TEST_DEFAULT_MODEL")
if custom != "" {
slog.Info("setting default test model to " + custom)
smol = custom
}
}
func FindPort() string {
@@ -312,6 +337,7 @@ func GetTestEndpoint() (*api.Client, string) {
var serverMutex sync.Mutex
var serverReady bool
var serverLogFile string
func startServer(t *testing.T, ctx context.Context, ollamaHost string) error {
// Make sure the server has been built
@@ -338,8 +364,9 @@ func startServer(t *testing.T, ctx context.Context, ollamaHost string) error {
t.Setenv("OLLAMA_HOST", ollamaHost)
}
logDir := t.TempDir()
slog.Info("starting server", "url", ollamaHost)
done, err := lifecycle.SpawnServer(ctx, "../ollama")
done, err := SpawnServer(ctx, "../ollama", logDir)
if err != nil {
return fmt.Errorf("failed to start server: %w", err)
}
@@ -362,6 +389,36 @@ func startServer(t *testing.T, ctx context.Context, ollamaHost string) error {
return nil
}
func SpawnServer(ctx context.Context, command, logDir string) (chan int, error) {
done := make(chan int)
fp, err := os.CreateTemp(logDir, "ollama-server-*.log")
if err != nil {
return nil, fmt.Errorf("failed to create log file: %w", err)
}
serverLogFile = fp.Name()
cmd := exec.CommandContext(ctx, command, "serve")
cmd.Stderr = fp
cmd.Stdout = fp
go func() {
slog.Info("starting server...")
if err := cmd.Run(); err != nil {
// "signal: killed" expected
if !strings.Contains(err.Error(), "signal") {
slog.Info("failed to run server", "error", err)
}
}
var code int
if cmd.ProcessState != nil {
code = cmd.ProcessState.ExitCode()
}
slog.Info("server exited")
done <- code
}()
return done, nil
}
func PullIfMissing(ctx context.Context, client *api.Client, modelName string) error {
slog.Info("checking status of model", "model", modelName)
showReq := &api.ShowRequest{Name: modelName}
@@ -422,58 +479,74 @@ func InitServerConnection(ctx context.Context, t *testing.T) (*api.Client, strin
client, testEndpoint := GetTestEndpoint()
if os.Getenv("OLLAMA_TEST_EXISTING") == "" {
serverProcMutex.Lock()
fp, err := os.CreateTemp("", "ollama-server-*.log")
if err != nil {
t.Fatalf("failed to generate log file: %s", err)
if err := startServer(t, ctx, testEndpoint); err != nil {
t.Fatal(err)
}
lifecycle.ServerLogFile = fp.Name()
fp.Close()
require.NoError(t, startServer(t, ctx, testEndpoint))
}
// Make sure server is online and healthy before returning
listCtx, cancel := context.WithDeadlineCause(
ctx,
time.Now().Add(120*time.Second),
fmt.Errorf("list models took too long"),
)
defer cancel()
models, err := client.ListRunning(listCtx)
if err != nil {
t.Fatal(err)
}
if len(models.Models) > 0 {
names := make([]string, len(models.Models))
for i, m := range models.Models {
names[i] = m.Name
}
slog.Info("currently loaded", "models", names)
}
return client, testEndpoint, func() {
if os.Getenv("OLLAMA_TEST_EXISTING") == "" {
defer serverProcMutex.Unlock()
if t.Failed() {
fp, err := os.Open(lifecycle.ServerLogFile)
fp, err := os.Open(serverLogFile)
if err != nil {
slog.Error("failed to open server log", "logfile", lifecycle.ServerLogFile, "error", err)
slog.Error("failed to open server log", "logfile", serverLogFile, "error", err)
return
}
defer fp.Close()
data, err := io.ReadAll(fp)
if err != nil {
slog.Error("failed to read server log", "logfile", lifecycle.ServerLogFile, "error", err)
slog.Error("failed to read server log", "logfile", serverLogFile, "error", err)
return
}
slog.Warn("SERVER LOG FOLLOWS")
os.Stderr.Write(data)
slog.Warn("END OF SERVER")
}
err := os.Remove(lifecycle.ServerLogFile)
if err != nil && !os.IsNotExist(err) {
slog.Warn("failed to cleanup", "logfile", lifecycle.ServerLogFile, "error", err)
}
}
}
}
func GenerateTestHelper(ctx context.Context, t *testing.T, genReq api.GenerateRequest, anyResp []string) {
func ChatTestHelper(ctx context.Context, t *testing.T, req api.ChatRequest, anyResp []string) {
client, _, cleanup := InitServerConnection(ctx, t)
defer cleanup()
require.NoError(t, PullIfMissing(ctx, client, genReq.Model))
DoGenerate(ctx, t, client, genReq, anyResp, 30*time.Second, 10*time.Second)
if err := PullIfMissing(ctx, client, req.Model); err != nil {
t.Fatal(err)
}
DoChat(ctx, t, client, req, anyResp, 30*time.Second, 10*time.Second)
}
func DoGenerate(ctx context.Context, t *testing.T, client *api.Client, genReq api.GenerateRequest, anyResp []string, initialTimeout, streamTimeout time.Duration) {
func DoGenerate(ctx context.Context, t *testing.T, client *api.Client, genReq api.GenerateRequest, anyResp []string, initialTimeout, streamTimeout time.Duration) []int {
stallTimer := time.NewTimer(initialTimeout)
var buf bytes.Buffer
var context []int
fn := func(response api.GenerateResponse) error {
// fmt.Print(".")
buf.Write([]byte(response.Response))
if !stallTimer.Reset(streamTimeout) {
return errors.New("stall was detected while streaming response, aborting")
}
if len(response.Context) > 0 {
context = response.Context
}
return nil
}
@@ -486,6 +559,22 @@ func DoGenerate(ctx context.Context, t *testing.T, client *api.Client, genReq ap
done <- 0
}()
var response string
verify := func() {
// Verify the response contains the expected data
response = buf.String()
atLeastOne := false
for _, resp := range anyResp {
if strings.Contains(strings.ToLower(response), resp) {
atLeastOne = true
break
}
}
if !atLeastOne {
t.Fatalf("%s: none of %v found in %s", genReq.Model, anyResp, response)
}
}
select {
case <-stallTimer.C:
if buf.Len() == 0 {
@@ -496,23 +585,21 @@ func DoGenerate(ctx context.Context, t *testing.T, client *api.Client, genReq ap
case <-done:
if genErr != nil && strings.Contains(genErr.Error(), "model requires more system memory") {
slog.Warn("model is too large for the target test system", "model", genReq.Model, "error", genErr)
return
return context
}
require.NoError(t, genErr, "failed with %s request prompt %s ", genReq.Model, genReq.Prompt)
// Verify the response contains the expected data
response := buf.String()
atLeastOne := false
for _, resp := range anyResp {
if strings.Contains(strings.ToLower(response), resp) {
atLeastOne = true
break
}
if genErr != nil {
t.Fatalf("%s failed with %s request prompt %s", genErr, genReq.Model, genReq.Prompt)
}
require.True(t, atLeastOne, "%s: none of %v found in %s", genReq.Model, anyResp, response)
verify()
slog.Info("test pass", "model", genReq.Model, "prompt", genReq.Prompt, "contains", anyResp, "response", response)
case <-ctx.Done():
t.Error("outer test context done while waiting for generate")
// On slow systems, we might timeout before some models finish rambling, so check what we have so far to see
// if it's considered a pass - the stallTimer will detect hangs, but we want to consider slow systems a pass
// if they are still generating valid responses
slog.Warn("outer test context done while waiting for generate")
verify()
}
return context
}
// Generate a set of requests
@@ -521,65 +608,132 @@ func GenerateRequests() ([]api.GenerateRequest, [][]string) {
return []api.GenerateRequest{
{
Model: smol,
Prompt: "why is the ocean blue?",
Prompt: "why is the ocean blue? Be brief but factual in your reply",
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
}, {
Model: smol,
Prompt: "why is the color of dirt brown?",
Prompt: "why is the color of dirt brown? Be brief but factual in your reply",
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
}, {
Model: smol,
Prompt: "what is the origin of the us thanksgiving holiday?",
Prompt: rainbowPrompt,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
}, {
Model: smol,
Prompt: "what is the origin of independence day?",
Prompt: "what is the origin of independence day? Be brief but factual in your reply",
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
}, {
Model: smol,
Prompt: "what is the composition of air?",
Prompt: "what is the composition of air? Be brief but factual in your reply",
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{
"seed": 42,
"temperature": 0.0,
},
},
},
[][]string{
{"sunlight"},
{"soil", "organic", "earth", "black", "tan"},
{"england", "english", "massachusetts", "pilgrims", "british"},
{"sunlight", "scatter", "interact", "color", "surface", "depth", "red", "orange", "yellow", "absorb", "wavelength", "water", "molecule"},
{"soil", "organic", "earth", "black", "tan", "chemical", "processes", "pigment", "particle", "iron oxide", "rust", "air", "water", "wet", "mixture", "mixing", "mineral", "element", "decomposed", "matter", "wavelength"},
rainbowExpected,
{"fourth", "july", "declaration", "independence"},
{"nitrogen", "oxygen", "carbon", "dioxide"},
{"nitrogen", "oxygen", "carbon", "dioxide", "water", "vapor", "fluid", "particles", "gas"},
}
}
func DoChat(ctx context.Context, t *testing.T, client *api.Client, req api.ChatRequest, anyResp []string, initialTimeout, streamTimeout time.Duration) *api.Message {
stallTimer := time.NewTimer(initialTimeout)
var buf bytes.Buffer
role := "assistant"
fn := func(response api.ChatResponse) error {
// fmt.Print(".")
role = response.Message.Role
buf.Write([]byte(response.Message.Content))
if !stallTimer.Reset(streamTimeout) {
return errors.New("stall was detected while streaming response, aborting")
}
return nil
}
stream := true
req.Stream = &stream
done := make(chan int)
var genErr error
go func() {
genErr = client.Chat(ctx, &req, fn)
done <- 0
}()
var response string
verify := func() {
// Verify the response contains the expected data
response = buf.String()
atLeastOne := false
for _, resp := range anyResp {
if strings.Contains(strings.ToLower(response), resp) {
atLeastOne = true
break
}
}
if !atLeastOne {
t.Fatalf("%s: none of %v found in \"%s\" -- request was:%v", req.Model, anyResp, response, req.Messages)
}
}
select {
case <-stallTimer.C:
if buf.Len() == 0 {
t.Errorf("generate never started. Timed out after :%s", initialTimeout.String())
} else {
t.Errorf("generate stalled. Response so far:%s", buf.String())
}
case <-done:
if genErr != nil && strings.Contains(genErr.Error(), "model requires more system memory") {
slog.Warn("model is too large for the target test system", "model", req.Model, "error", genErr)
return nil
}
if genErr != nil {
t.Fatalf("%s failed with %s request prompt %v", genErr, req.Model, req.Messages)
}
verify()
slog.Info("test pass", "model", req.Model, "messages", req.Messages, "contains", anyResp, "response", response)
case <-ctx.Done():
// On slow systems, we might timeout before some models finish rambling, so check what we have so far to see
// if it's considered a pass - the stallTimer will detect hangs, but we want to consider slow systems a pass
// if they are still generating valid responses
slog.Warn("outer test context done while waiting for chat")
verify()
}
return &api.Message{Role: role, Content: buf.String()}
}
func ChatRequests() ([]api.ChatRequest, [][]string) {
genReqs, results := GenerateRequests()
reqs := make([]api.ChatRequest, len(genReqs))
// think := api.ThinkValue{Value: "low"}
for i := range reqs {
reqs[i].Model = genReqs[i].Model
reqs[i].Stream = genReqs[i].Stream
reqs[i].KeepAlive = genReqs[i].KeepAlive
// reqs[i].Think = &think
reqs[i].Messages = []api.Message{
{
Role: "user",
Content: genReqs[i].Prompt,
},
}
}
return reqs, results
}
func skipUnderMinVRAM(t *testing.T, gb uint64) {
// TODO use info API in the future
if s := os.Getenv("OLLAMA_MAX_VRAM"); s != "" {
maxVram, err := strconv.ParseUint(s, 10, 64)
require.NoError(t, err)
if err != nil {
t.Fatal(err)
}
// Don't hammer on small VRAM cards...
if maxVram < gb*format.GibiByte {
t.Skip("skipping with small VRAM to avoid timeouts")
@@ -587,6 +741,50 @@ func skipUnderMinVRAM(t *testing.T, gb uint64) {
}
}
// Skip if the target model isn't X% GPU loaded to avoid excessive runtime
func skipIfNotGPULoaded(ctx context.Context, t *testing.T, client *api.Client, model string, minPercent int) {
gpuPercent := getGPUPercent(ctx, t, client, model)
if gpuPercent < minPercent {
t.Skip(fmt.Sprintf("test requires minimum %d%% GPU load, but model %s only has %d%%", minPercent, model, gpuPercent))
}
}
func getGPUPercent(ctx context.Context, t *testing.T, client *api.Client, model string) int {
models, err := client.ListRunning(ctx)
if err != nil {
t.Fatalf("failed to list running models: %s", err)
}
loaded := []string{}
for _, m := range models.Models {
loaded = append(loaded, m.Name)
if strings.Contains(model, ":") {
if m.Name != model {
continue
}
} else if strings.Contains(m.Name, ":") {
if !strings.HasPrefix(m.Name, model+":") {
continue
}
}
gpuPercent := 0
switch {
case m.SizeVRAM == 0:
gpuPercent = 0
case m.SizeVRAM == m.Size:
gpuPercent = 100
case m.SizeVRAM > m.Size || m.Size == 0:
t.Logf("unexpected size detected: %d", m.SizeVRAM)
default:
sizeCPU := m.Size - m.SizeVRAM
cpuPercent := math.Round(float64(sizeCPU) / float64(m.Size) * 110)
gpuPercent = int(100 - cpuPercent)
}
return gpuPercent
}
t.Fatalf("model %s not loaded - actually loaded: %v", model, loaded)
return 0
}
func getTimeouts(t *testing.T) (soft time.Duration, hard time.Duration) {
deadline, hasDeadline := t.Deadline()
if !hasDeadline {

View File

@@ -19,9 +19,16 @@ type shiftFn func(ctx ml.Context, layer int, key, shift ml.Tensor) (ml.Tensor, e
// The tensors are of shape embed dim, kv heads, batch size
// The mask is of shape history size, batch size
type Causal struct {
DType ml.DType
windowSize int32
chunkSize int32
DType ml.DType
// swaWindowSize is the number of tokens that will be included in the mask
// during attention operations. swaMemorySize is the number of tokens that
// will be retained in memory for partial prefix caching. Set to math.MaxInt32
// for unlimited or if sliding window attention is not being used.
swaWindowSize int32
swaMemorySize int32
chunkSize int32
opts CausalOptions
@@ -33,11 +40,6 @@ type Causal struct {
// ** current forward pass **
// curReserve indicates that this forward pass is only for
// memory reservation and we should not update our metadata
// based on it.
curReserve bool
// the active layer for Get and Put
curLayer int
@@ -88,32 +90,41 @@ type cellRange struct {
func NewCausalCache(shift shiftFn) *Causal {
return &Causal{
windowSize: math.MaxInt32,
shiftFn: shift,
ctxs: make(map[int]ml.Context),
keys: make(map[int]ml.Tensor),
values: make(map[int]ml.Tensor),
shiftFn: shift,
ctxs: make(map[int]ml.Context),
keys: make(map[int]ml.Tensor),
values: make(map[int]ml.Tensor),
}
}
func NewSWACache(windowSize int32, shift shiftFn) *Causal {
return &Causal{
windowSize: windowSize,
shiftFn: shift,
ctxs: make(map[int]ml.Context),
keys: make(map[int]ml.Tensor),
values: make(map[int]ml.Tensor),
swaWindowSize: windowSize,
shiftFn: shift,
ctxs: make(map[int]ml.Context),
keys: make(map[int]ml.Tensor),
values: make(map[int]ml.Tensor),
}
}
func NewSWAMemCache(windowSize int32, memorySize int32, shift shiftFn) *Causal {
return &Causal{
swaWindowSize: windowSize,
swaMemorySize: memorySize,
shiftFn: shift,
ctxs: make(map[int]ml.Context),
keys: make(map[int]ml.Tensor),
values: make(map[int]ml.Tensor),
}
}
func NewChunkedAttentionCache(chunkSize int32, shift shiftFn) *Causal {
return &Causal{
windowSize: math.MaxInt32,
chunkSize: chunkSize,
shiftFn: shift,
ctxs: make(map[int]ml.Context),
keys: make(map[int]ml.Tensor),
values: make(map[int]ml.Tensor),
chunkSize: chunkSize,
shiftFn: shift,
ctxs: make(map[int]ml.Context),
keys: make(map[int]ml.Tensor),
values: make(map[int]ml.Tensor),
}
}
@@ -138,11 +149,33 @@ func (c *Causal) Init(backend ml.Backend, dtype ml.DType, maxSequences, capacity
c.config.MaskDType = ml.DTypeF32
}
if c.swaWindowSize == 0 {
c.swaWindowSize = math.MaxInt32
}
if c.swaMemorySize == 0 {
c.swaMemorySize = c.swaWindowSize
}
// We will allocate space in the cache for the stop token, which won't be part of a follow on
// sequence, so allocate an extra token of storage to ensure that we can jump back without
// causing a cache break. As an optimization, only do this when we have parallel sequences
// because the extra token will live in the batch buffer and won't get overwritten if we
// only have a single sequence.
if c.swaMemorySize != math.MaxInt32 && maxSequences > 1 {
c.swaMemorySize = max(c.swaMemorySize, c.swaWindowSize+1)
}
if int(c.swaMemorySize) >= capacity {
c.swaMemorySize = math.MaxInt32
}
if c.swaMemorySize < c.swaWindowSize {
panic(fmt.Errorf("sliding window memory (%v) must be at least as large as the window (%v)", c.swaMemorySize, c.swaWindowSize))
}
var cacheSize int
if c.windowSize == math.MaxInt32 || capacity < int(c.windowSize) {
if c.swaMemorySize == math.MaxInt32 {
cacheSize = maxSequences * capacity
} else {
cacheSize = (maxSequences * int(c.windowSize)) + maxBatch
cacheSize = (maxSequences * int(c.swaMemorySize)) + maxBatch
}
cacheSize = roundUp(cacheSize, c.config.CachePadding)
c.cells = make([]cacheCell, cacheSize)
@@ -168,13 +201,12 @@ func (c *Causal) Close() {
}
func (c *Causal) StartForward(ctx ml.Context, batch input.Batch, reserve bool) error {
c.curReserve = reserve
c.curBatchSize = len(batch.Positions)
c.curSequences = batch.Sequences
c.curPositions = batch.Positions
c.opts.Except = nil
if !c.curReserve {
if !reserve {
c.updateSlidingWindow()
var err error
@@ -187,7 +219,6 @@ func (c *Causal) StartForward(ctx ml.Context, batch input.Batch, reserve bool) e
return err
}
c.curCellRange = newRange()
for i, pos := range batch.Positions {
seq := batch.Sequences[i]
@@ -198,19 +229,12 @@ func (c *Causal) StartForward(ctx ml.Context, batch input.Batch, reserve bool) e
seqRange = newRange()
}
if c.curLoc+i > seqRange.max {
seqRange.max = c.curLoc + i
}
if seqRange.max > c.curCellRange.max {
c.curCellRange.max = seqRange.max
}
seqRange.min = min(seqRange.min, c.curLoc+i)
c.curCellRange.min = min(c.curCellRange.min, c.curLoc+i)
seqRange.max = max(seqRange.max, c.curLoc+i)
c.curCellRange.max = max(c.curCellRange.max, c.curLoc+i)
if c.curLoc+i < seqRange.min {
seqRange.min = c.curLoc + i
}
if seqRange.min < c.curCellRange.min {
c.curCellRange.min = seqRange.min
}
c.cellRanges[seq] = seqRange
}
} else {
@@ -252,27 +276,57 @@ func (c *Causal) findStartLoc() (int, error) {
}
func (c *Causal) updateSlidingWindow() {
if c.windowSize == math.MaxInt32 {
c.curCellRange = newRange()
if c.swaMemorySize == math.MaxInt32 {
for _, seq := range c.curSequences {
if seqRange, ok := c.cellRanges[seq]; ok {
c.curCellRange.min = min(c.curCellRange.min, seqRange.min)
c.curCellRange.max = max(c.curCellRange.max, seqRange.max)
}
}
return
}
type lowestPosition struct {
pos int32
curBatch bool
}
// create a map of unique sequences to the lowest position in that sequence
lowestPos := make(map[int]int32)
lowestPos := make(map[int]lowestPosition)
for i := range c.curPositions {
seq := c.curSequences[i]
pos, ok := lowestPos[seq]
lowest, ok := lowestPos[seq]
if !ok {
pos = c.curPositions[i]
} else if c.curPositions[i] < pos {
pos = c.curPositions[i]
lowest = lowestPosition{pos: c.curPositions[i], curBatch: true}
} else if c.curPositions[i] < lowest.pos {
lowest.pos = c.curPositions[i]
}
lowestPos[seq] = pos
lowestPos[seq] = lowest
}
// for any sequences are not part of this batch, clean up any tokens
// that are no longer needed after the processing of the previous
// batch
for seq, seqRange := range c.cellRanges {
if _, ok := lowestPos[seq]; !ok {
var last int32
for i := seqRange.min; i <= seqRange.max; i++ {
if slices.Contains(c.cells[i].sequences, seq) {
last = max(last, c.cells[i].pos)
}
}
lowestPos[seq] = lowestPosition{pos: last + 1, curBatch: false}
}
}
// delete any entries that are beyond the window of the oldest position in the sequence
for seq, pos := range lowestPos {
for seq, lowest := range lowestPos {
oldRange, ok := c.cellRanges[seq]
if !ok {
continue
@@ -282,12 +336,16 @@ func (c *Causal) updateSlidingWindow() {
for i := oldRange.min; i <= oldRange.max; i++ {
if slices.Contains(c.cells[i].sequences, seq) {
if c.cells[i].pos < pos-c.windowSize {
if c.cells[i].pos < lowest.pos-c.swaMemorySize {
c.cells[i].sequences = slices.DeleteFunc(c.cells[i].sequences, func(s int) bool { return s == seq })
} else {
newRange.min = min(newRange.min, i)
newRange.max = max(newRange.max, i)
}
if lowest.curBatch && c.cells[i].pos >= lowest.pos-c.swaWindowSize {
c.curCellRange.min = min(c.curCellRange.min, i)
c.curCellRange.max = max(c.curCellRange.max, i)
}
}
}
@@ -315,10 +373,6 @@ func (c *Causal) buildMask(ctx ml.Context) ml.Tensor {
length := c.curCellRange.max - c.curCellRange.min + 1
if c.curReserve {
return ctx.Input().Empty(c.config.MaskDType, length, batchSize)
}
mask := make([]float32, batchSize*length)
for i := range c.curBatchSize {
@@ -327,7 +381,7 @@ func (c *Causal) buildMask(ctx ml.Context) ml.Tensor {
if !slices.Contains(c.cells[j].sequences, c.curSequences[i]) ||
(enabled && c.cells[j].pos > c.curPositions[i]) ||
c.chunkSize > 0 && c.cells[j].pos < c.curPositions[i]-c.curPositions[i]%c.chunkSize ||
c.cells[j].pos < c.curPositions[i]-c.windowSize {
c.cells[j].pos < c.curPositions[i]-c.swaWindowSize {
mask[i*length+(j-c.curCellRange.min)] = float32(math.Inf(-1))
}
}
@@ -342,9 +396,7 @@ func (c *Causal) buildMask(ctx ml.Context) ml.Tensor {
maskTensor := ctx.Input().FromFloatSlice(mask, length, batchSize)
if c.config.MaskDType != ml.DTypeF32 {
out := ctx.Input().Empty(c.config.MaskDType, maskTensor.Shape()...)
ctx.Forward(maskTensor.Copy(ctx, out))
maskTensor = out
maskTensor = maskTensor.Cast(ctx, c.config.MaskDType)
}
return maskTensor
@@ -485,6 +537,8 @@ func (c *Causal) defrag() {
c.cellRanges[seq] = seqRange
}
c.updateSlidingWindow()
}
func (c *Causal) SetLayer(layer int) {
@@ -610,7 +664,7 @@ func (c *Causal) CopyPrefix(srcSeq, dstSeq int, len int32) {
}
func (c *Causal) CanResume(seq int, pos int32) bool {
if c.windowSize == math.MaxInt32 {
if c.swaMemorySize == math.MaxInt32 {
return true
}
@@ -621,9 +675,11 @@ func (c *Causal) CanResume(seq int, pos int32) bool {
// for sliding window, check that the window of the new sequence is contained in
// the window of what we are storing
var first int32 = math.MaxInt32
var last int32 = -1
for i := seqRange.min; i <= seqRange.max; i++ {
if slices.Contains(c.cells[i].sequences, seq) {
first = min(first, c.cells[i].pos)
last = max(last, c.cells[i].pos)
}
}
@@ -632,10 +688,8 @@ func (c *Causal) CanResume(seq int, pos int32) bool {
return false
}
lastWindowStart := max(0, last-c.windowSize)
posWindowStart := max(0, pos-c.windowSize)
return posWindowStart >= lastWindowStart
posWindowStart := max(0, pos-c.swaWindowSize)
return posWindowStart >= first && pos <= last+1
}
func (c *Causal) shift(seq int, beginIndex, offset int32) error {

View File

@@ -60,6 +60,8 @@ func TestSWA(t *testing.T) {
cache.Init(backend, ml.DTypeF16, 1, 16, 16)
x := float32(math.Inf(-1))
tests := []testCase{
{
name: "FirstBatch",
@@ -69,7 +71,12 @@ func TestSWA(t *testing.T) {
pos: []int32{0, 1, 2, 3},
expected: []float32{1, 2, 3, 4},
expectedShape: []int{1, 1, 4},
expectedMask: []float32{0, float32(math.Inf(-1)), float32(math.Inf(-1)), float32(math.Inf(-1)), 0, 0, float32(math.Inf(-1)), float32(math.Inf(-1)), float32(math.Inf(-1)), 0, 0, float32(math.Inf(-1)), float32(math.Inf(-1)), float32(math.Inf(-1)), 0, 0},
expectedMask: []float32{
0, x, x, x,
0, 0, x, x,
x, 0, 0, x,
x, x, 0, 0,
},
},
{
name: "SecondBatch",
@@ -79,7 +86,133 @@ func TestSWA(t *testing.T) {
pos: []int32{4, 5},
expected: []float32{5, 6, 3, 4},
expectedShape: []int{1, 1, 4},
expectedMask: []float32{0, float32(math.Inf(-1)), float32(math.Inf(-1)), 0, 0, 0, float32(math.Inf(-1)), float32(math.Inf(-1))},
expectedMask: []float32{
0, x, x, 0,
0, 0, x, x,
},
},
}
testCache(t, backend, cache, tests)
}
func TestSWASeparateBatches(t *testing.T) {
backend := &testBackend{}
cache := NewSWACache(1, nil)
defer cache.Close()
cache.Init(backend, ml.DTypeF16, 2, 16, 2)
x := float32(math.Inf(-1))
tests := []testCase{
{
name: "First seq 0",
in: []float32{1, 2},
inShape: []int{1, 1, 2},
seqs: []int{0, 0},
pos: []int32{0, 1},
expected: []float32{1, 2},
expectedShape: []int{1, 1, 2},
expectedMask: []float32{
0, x,
0, 0,
},
},
{
name: "Second seq 0",
in: []float32{3, 4},
inShape: []int{1, 1, 2},
seqs: []int{0, 0},
pos: []int32{2, 3},
expected: []float32{2, 3, 4},
expectedShape: []int{1, 1, 3},
expectedMask: []float32{
0, 0, x,
x, 0, 0,
},
},
{
name: "First seq 1",
in: []float32{5, 6},
inShape: []int{1, 1, 2},
seqs: []int{1, 1},
pos: []int32{0, 1},
expected: []float32{5, 6},
expectedShape: []int{1, 1, 2},
expectedMask: []float32{
0, x,
0, 0,
},
},
{
name: "Second seq 1",
in: []float32{7, 8},
inShape: []int{1, 1, 2},
seqs: []int{1, 1},
pos: []int32{2, 3},
expected: []float32{6, 3, 4, 7, 8},
expectedShape: []int{1, 1, 5},
expectedMask: []float32{
0, x, x, 0, x,
x, x, x, 0, 0,
},
},
{
name: "Third seq 0",
in: []float32{9, 10},
inShape: []int{1, 1, 2},
seqs: []int{0, 0},
pos: []int32{4, 5},
expected: []float32{9, 10, 3, 4},
expectedShape: []int{1, 1, 4},
expectedMask: []float32{
0, x, x, 0,
0, 0, x, x,
},
},
}
testCache(t, backend, cache, tests)
}
func TestSWAMem(t *testing.T) {
backend := &testBackend{}
cache := NewSWAMemCache(1, 3, nil)
defer cache.Close()
cache.Init(backend, ml.DTypeF16, 1, 16, 16)
x := float32(math.Inf(-1))
tests := []testCase{
{
name: "FirstBatch",
in: []float32{1, 2, 3, 4},
inShape: []int{1, 1, 4},
seqs: []int{0, 0, 0, 0},
pos: []int32{0, 1, 2, 3},
expected: []float32{1, 2, 3, 4},
expectedShape: []int{1, 1, 4},
expectedMask: []float32{
0, x, x, x,
0, 0, x, x,
x, 0, 0, x,
x, x, 0, 0,
},
},
{
name: "SecondBatch",
in: []float32{5, 6},
inShape: []int{1, 1, 2},
seqs: []int{0, 0},
pos: []int32{4, 5},
expected: []float32{4, 5, 6},
expectedShape: []int{1, 1, 3},
expectedMask: []float32{
0, 0, x,
x, 0, 0,
},
},
}
@@ -378,15 +511,15 @@ func TestCanResume(t *testing.T) {
defer context.Close()
err := cache.StartForward(context, input.Batch{
Positions: []int32{0, 1, 2, 3},
Sequences: []int{0, 0, 0, 0},
Positions: []int32{0, 1, 2, 3, 4},
Sequences: []int{0, 0, 0, 0, 0},
}, false)
if err != nil {
t.Fatalf("StartForward failed: %v", err)
}
cache.SetLayer(0)
tensor := context.FromFloatSlice([]float32{1, 2, 3, 4}, 1, 1, 4)
tensor := context.FromFloatSlice([]float32{1, 2, 3, 4, 5}, 1, 1, 5)
cache.Put(context, tensor, tensor)
// with window size 4, nothing has slid out of the window yet
@@ -402,18 +535,21 @@ func TestCanResume(t *testing.T) {
if !cache.CanResume(0, 3) {
t.Errorf("CanResume(0, 3) = false, want true (latest position)")
}
if !cache.CanResume(0, 4) {
t.Errorf("CanResume(0, 4) = false, want true (latest position)")
}
// shift window by adding position 4
// shift window by adding position 5
err = cache.StartForward(context, input.Batch{
Positions: []int32{4, 5},
Sequences: []int{0, 0},
Positions: []int32{5},
Sequences: []int{0},
}, false)
if err != nil {
t.Fatalf("StartForward failed: %v", err)
}
cache.SetLayer(0)
tensor = context.FromFloatSlice([]float32{5, 6}, 1, 1, 2)
tensor = context.FromFloatSlice([]float32{6}, 1, 1, 1)
cache.Put(context, tensor, tensor)
// only the latest position has overlapping windows
@@ -437,6 +573,70 @@ func TestCanResume(t *testing.T) {
}
}
func TestCanResumeSWAMem(t *testing.T) {
backend := &testBackend{}
windowSize := int32(4)
memSize := int32(5)
cache := NewSWAMemCache(windowSize, memSize, nil)
defer cache.Close()
cache.Init(backend, ml.DTypeF16, 1, 16, 16)
context := backend.NewContext()
defer context.Close()
err := cache.StartForward(context, input.Batch{
Positions: []int32{0, 1, 2, 3, 4, 5, 6},
Sequences: []int{0, 0, 0, 0, 0, 0, 0},
}, false)
if err != nil {
t.Fatalf("StartForward failed: %v", err)
}
cache.SetLayer(0)
tensor := context.FromFloatSlice([]float32{1, 2, 3, 4, 5, 6, 7}, 1, 1, 7)
cache.Put(context, tensor, tensor)
// shift window by adding position 7
err = cache.StartForward(context, input.Batch{
Positions: []int32{7},
Sequences: []int{0},
}, false)
if err != nil {
t.Fatalf("StartForward failed: %v", err)
}
cache.SetLayer(0)
tensor = context.FromFloatSlice([]float32{8}, 1, 1, 1)
cache.Put(context, tensor, tensor)
// only the latest position has overlapping windows
if cache.CanResume(0, 0) {
t.Errorf("after shift: CanResume(0, 0) = true, want false (outside window)")
}
if cache.CanResume(0, 1) {
t.Errorf("after shift: CanResume(0, 1) = true, want false (outside window)")
}
if cache.CanResume(0, 2) {
t.Errorf("after shift: CanResume(0, 2) = true, want false (outside window)")
}
if cache.CanResume(0, 3) {
t.Errorf("after shift: CanResume(0, 3) = true, want false (outside window)")
}
if cache.CanResume(0, 4) {
t.Errorf("after shift: CanResume(0, 4) = true, want false (outside window)")
}
if cache.CanResume(0, 5) {
t.Errorf("after shift: CanResume(0, 5) = true, want false (outside window)")
}
if !cache.CanResume(0, 6) {
t.Errorf("after shift: CanResume(0, 6) = false, want true (inside window)")
}
if !cache.CanResume(0, 7) {
t.Errorf("after shift: CanResume(0, 7) = false, want true (latest position)")
}
}
type testBackend struct {
ml.Backend
}

2
llama/build-info.cpp generated vendored
View File

@@ -1,4 +1,4 @@
int LLAMA_BUILD_NUMBER = 0;
char const *LLAMA_COMMIT = "de4c07f93783a1a96456a44dc16b9db538ee1618";
char const *LLAMA_COMMIT = "7049736b2dd9011bf819e298b844ebbc4b5afdc9";
char const *LLAMA_COMPILER = "";
char const *LLAMA_BUILD_TARGET = "";

View File

@@ -1,23 +1,32 @@
protect **/*.go
include common/
include common/base64.*
include common/common.*
include common/json-schema-to-grammar.*
include common/json.*
include common/log.*
include common/sampling.*
include common/stb_image.*
include include/
include include/llama.*
include include/llama-*.*
include tools/
include tools/mtmd/
include tools/mtmd/clip.*
include tools/mtmd/clip-impl.*
include tools/mtmd/llava.*
include src/
include src/llama.*
include src/llama-*.*
include src/unicode-data.*
include src/unicode.*
exclude *
protect .rsync-filter
protect *.go
include /common/
include /common/base64.*
include /common/common.*
include /common/json-schema-to-grammar.*
include /common/json.*
include /common/log.*
include /common/sampling.*
include /include/
include /include/llama.*
include /include/llama-*.*
include /tools/
include /tools/mtmd/
include /tools/mtmd/*.h
include /tools/mtmd/clip.cpp
include /tools/mtmd/mtmd.cpp
include /tools/mtmd/mtmd-audio.cpp
include /tools/mtmd/mtmd-helper.cpp
include /src/
include /src/llama.*
include /src/llama-*.*
include /src/unicode-data.*
include /src/unicode.*
include /vendor/
include /vendor/miniaudio/
include /vendor/miniaudio/*.h
include /vendor/nlohmann/
include /vendor/nlohmann/*.hpp
include /vendor/stb/
include /vendor/stb/*.h
hide *

View File

@@ -14,6 +14,7 @@
#include <climits>
#include <cmath>
#include <codecvt>
#include <chrono>
#include <cstdarg>
#include <cstring>
#include <ctime>
@@ -41,6 +42,7 @@
#endif
#include <locale>
#include <windows.h>
#include <string.h>
#include <fcntl.h>
#include <io.h>
#else
@@ -49,6 +51,11 @@
#include <unistd.h>
#endif
#if defined(__linux__)
#include <sys/types.h>
#include <pwd.h>
#endif
#if defined(_MSC_VER)
#pragma warning(disable: 4244 4267) // possible loss of data
#endif
@@ -203,6 +210,7 @@ bool set_process_priority(enum ggml_sched_priority prio) {
DWORD p = NORMAL_PRIORITY_CLASS;
switch (prio) {
case GGML_SCHED_PRIO_LOW: p = BELOW_NORMAL_PRIORITY_CLASS; break;
case GGML_SCHED_PRIO_NORMAL: p = NORMAL_PRIORITY_CLASS; break;
case GGML_SCHED_PRIO_MEDIUM: p = ABOVE_NORMAL_PRIORITY_CLASS; break;
case GGML_SCHED_PRIO_HIGH: p = HIGH_PRIORITY_CLASS; break;
@@ -228,6 +236,7 @@ bool set_process_priority(enum ggml_sched_priority prio) {
int p = 0;
switch (prio) {
case GGML_SCHED_PRIO_LOW: p = 5; break;
case GGML_SCHED_PRIO_NORMAL: p = 0; break;
case GGML_SCHED_PRIO_MEDIUM: p = -5; break;
case GGML_SCHED_PRIO_HIGH: p = -10; break;
@@ -443,9 +452,37 @@ void string_replace_all(std::string & s, const std::string & search, const std::
s = std::move(builder);
}
bool string_ends_with(const std::string_view & str, const std::string_view & suffix) {
return str.size() >= suffix.size() && str.compare(str.size()-suffix.size(), suffix.size(), suffix) == 0;
}
bool string_remove_suffix(std::string & str, const std::string_view & suffix) {
bool has_suffix = string_ends_with(str, suffix);
if (has_suffix) {
str = str.substr(0, str.size() - suffix.size());
}
return has_suffix;
}
size_t string_find_partial_stop(const std::string_view & str, const std::string_view & stop) {
if (!str.empty() && !stop.empty()) {
const char text_last_char = str.back();
for (int64_t char_index = stop.size() - 1; char_index >= 0; char_index--) {
if (stop[char_index] == text_last_char) {
const auto current_partial = stop.substr(0, char_index + 1);
if (string_ends_with(str, current_partial)) {
return str.size() - char_index - 1;
}
}
}
}
return std::string::npos;
}
std::string regex_escape(const std::string & s) {
static const std::regex special_chars("[.^$|()*+?\\[\\]{}\\\\]");
return std::regex_replace(s, special_chars, "\\$0");
return std::regex_replace(s, special_chars, "\\$&");
}
std::string string_join(const std::vector<std::string> & values, const std::string & separator) {
@@ -527,13 +564,6 @@ std::string string_from(const struct llama_context * ctx, const std::vector<llam
auto detokenized = common_token_to_piece(ctx, token);
detokenized.erase(
std::remove_if(
detokenized.begin(),
detokenized.end(),
[](const unsigned char c) { return !std::isprint(c); }),
detokenized.end());
buf << "'" << detokenized << "'"
<< ":" << std::to_string(token);
}
@@ -558,13 +588,6 @@ std::string string_from(const struct llama_context * ctx, const struct llama_bat
auto detokenized = common_token_to_piece(ctx, batch.token[i]);
detokenized.erase(
std::remove_if(
detokenized.begin(),
detokenized.end(),
[](const unsigned char c) { return !std::isprint(c); }),
detokenized.end());
buf << "\n" << std::to_string(i)
<< ", token '" << detokenized << "'"
<< ", pos " << std::to_string(batch.pos[i])
@@ -685,11 +708,17 @@ bool fs_validate_filename(const std::string & filename) {
// disable C++17 deprecation warning for std::codecvt_utf8
# pragma clang diagnostic push
# pragma clang diagnostic ignored "-Wdeprecated-declarations"
#elif defined(__GNUC__)
# pragma GCC diagnostic push
# pragma GCC diagnostic ignored "-Wdeprecated-declarations"
#endif
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
#if defined(__clang__)
# pragma clang diagnostic pop
#elif defined(__GNUC__)
# pragma GCC diagnostic pop
#endif
filename_utf32 = converter.from_bytes(filename);
@@ -746,6 +775,9 @@ bool fs_validate_filename(const std::string & filename) {
return true;
}
#include <iostream>
// returns true if successful, false otherwise
bool fs_create_directory_with_parents(const std::string & path) {
#ifdef _WIN32
@@ -763,9 +795,16 @@ bool fs_create_directory_with_parents(const std::string & path) {
// process path from front to back, procedurally creating directories
while ((pos_slash = path.find('\\', pos_slash)) != std::string::npos) {
const std::wstring subpath = wpath.substr(0, pos_slash);
const wchar_t * test = subpath.c_str();
const bool success = CreateDirectoryW(test, NULL);
pos_slash += 1;
// skip the drive letter, in some systems it can return an access denied error
if (subpath.length() == 2 && subpath[1] == ':') {
continue;
}
const bool success = CreateDirectoryW(subpath.c_str(), NULL);
if (!success) {
const DWORD error = GetLastError();
@@ -779,8 +818,6 @@ bool fs_create_directory_with_parents(const std::string & path) {
return false;
}
}
pos_slash += 1;
}
return true;
@@ -830,11 +867,23 @@ std::string fs_get_cache_directory() {
if (getenv("LLAMA_CACHE")) {
cache_directory = std::getenv("LLAMA_CACHE");
} else {
#if defined(__linux__) || defined(__FreeBSD__) || defined(_AIX)
#if defined(__linux__) || defined(__FreeBSD__) || defined(_AIX) || defined(__OpenBSD__)
if (std::getenv("XDG_CACHE_HOME")) {
cache_directory = std::getenv("XDG_CACHE_HOME");
} else {
} else if (std::getenv("HOME")) {
cache_directory = std::getenv("HOME") + std::string("/.cache/");
} else {
#if defined(__linux__)
/* no $HOME is defined, fallback to getpwuid */
struct passwd *pw = getpwuid(getuid());
if ((!pw) || (!pw->pw_dir)) {
throw std::runtime_error("Failed to find $HOME directory");
}
cache_directory = std::string(pw->pw_dir) + std::string("/.cache/");
#else /* defined(__linux__) */
throw std::runtime_error("Failed to find $HOME directory");
#endif /* defined(__linux__) */
}
#elif defined(__APPLE__)
cache_directory = std::getenv("HOME") + std::string("/Library/Caches/");
@@ -870,47 +919,24 @@ struct common_init_result common_init_from_params(common_params & params) {
llama_model * model = llama_model_load_from_file(params.model.path.c_str(), mparams);
if (model == NULL) {
LOG_ERR("%s: failed to load model '%s'\n", __func__, params.model.path.c_str());
LOG_ERR("%s: failed to load model '%s', try reducing --n-gpu-layers if you're running out of VRAM\n",
__func__, params.model.path.c_str());
return iparams;
}
const llama_vocab * vocab = llama_model_get_vocab(model);
if (params.reranking) {
bool ok = true;
if (llama_vocab_bos(vocab) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: vocab does not have a BOS token, reranking will not work\n", __func__);
ok = false;
}
if (llama_vocab_eos(vocab) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: vocab does not have an EOS token, reranking will not work\n", __func__);
ok = false;
}
if (llama_vocab_sep(vocab) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: vocab does not have a SEP token, reranking will not work\n", __func__);
ok = false;
}
if (!ok) {
llama_model_free(model);
return iparams;
}
}
auto cparams = common_context_params_to_llama(params);
llama_context * lctx = llama_init_from_model(model, cparams);
if (lctx == NULL) {
LOG_ERR("%s: failed to create context with model '%s'\n", __func__, params.model.path.c_str());
LOG_ERR("%s: failed to create context with model '%s', try reducing --n-gpu-layers if you're running out of VRAM\n",
__func__, params.model.path.c_str());
llama_model_free(model);
return iparams;
}
if (params.ctx_shift && !llama_kv_self_can_shift(lctx)) {
if (params.ctx_shift && !llama_memory_can_shift(llama_get_memory(lctx))) {
LOG_WRN("%s: KV cache shifting is not supported for this context, disabling KV cache shifting\n", __func__);
params.ctx_shift = false;
}
@@ -942,6 +968,33 @@ struct common_init_result common_init_from_params(common_params & params) {
}
}
if (llama_pooling_type(lctx) == LLAMA_POOLING_TYPE_RANK) {
bool ok = true;
if (llama_vocab_bos(vocab) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: vocab does not have a BOS token, reranking will not work\n", __func__);
ok = false;
}
bool has_eos = llama_vocab_eos(vocab) != LLAMA_TOKEN_NULL;
bool has_sep = llama_vocab_sep(vocab) != LLAMA_TOKEN_NULL;
bool has_rerank_prompt = llama_model_chat_template(model, "rerank") != NULL;
if (!has_eos && !has_sep && !has_rerank_prompt) {
LOG_WRN("%s: warning: vocab does not have an EOS token, SEP token, or rerank prompt. Reranking will not work\n", __func__);
ok = false;
} else if (!has_eos) {
LOG_WRN("%s: warning: vocab does not have an EOS token, using SEP token as fallback\n", __func__);
}
if (!ok) {
llama_free(lctx);
llama_model_free(model);
return iparams;
}
}
// load and optionally apply lora adapters
for (auto & la : params.lora_adapters) {
llama_adapter_lora_ptr lora;
@@ -953,7 +1006,12 @@ struct common_init_result common_init_from_params(common_params & params) {
return iparams;
}
char buf[1024];
la.ptr = lora.get();
llama_adapter_meta_val_str(la.ptr, "adapter.lora.task_name", buf, sizeof(buf));
la.task_name = buf;
llama_adapter_meta_val_str(la.ptr, "adapter.lora.prompt_prefix", buf, sizeof(buf));
la.prompt_prefix = buf;
iparams.lora.emplace_back(std::move(lora)); // copy to list of loaded adapters
}
@@ -966,15 +1024,21 @@ struct common_init_result common_init_from_params(common_params & params) {
params.sampling.ignore_eos = false;
}
if (params.sampling.ignore_eos) {
for (llama_token i = 0; i < llama_vocab_n_tokens(vocab); i++) {
if (llama_vocab_is_eog(vocab, i)) {
LOG_INF("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(lctx, i).c_str(), -INFINITY);
params.sampling.logit_bias.push_back({i, -INFINITY});
}
// initialize once
for (llama_token i = 0; i < llama_vocab_n_tokens(vocab); i++) {
if (llama_vocab_is_eog(vocab, i)) {
LOG_INF("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(lctx, i).c_str(), -INFINITY);
params.sampling.logit_bias_eog.push_back({i, -INFINITY});
}
}
if (params.sampling.ignore_eos) {
// add EOG biases to the active set of logit biases
params.sampling.logit_bias.insert(
params.sampling.logit_bias.end(),
params.sampling.logit_bias_eog.begin(), params.sampling.logit_bias_eog.end());
}
if (params.sampling.penalty_last_n == -1) {
LOG_INF("%s: setting penalty_last_n to ctx_size = %d\n", __func__, llama_n_ctx(lctx));
params.sampling.penalty_last_n = llama_n_ctx(lctx);
@@ -1017,7 +1081,7 @@ struct common_init_result common_init_from_params(common_params & params) {
if (llama_model_has_decoder(model)) {
llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
}
llama_kv_self_clear(lctx);
llama_memory_clear(llama_get_memory(lctx), true);
llama_synchronize(lctx);
llama_perf_context_reset(lctx);
llama_set_warmup(lctx, false);
@@ -1068,6 +1132,8 @@ struct llama_model_params common_model_params_to_llama(common_params & params) {
mparams.use_mmap = params.use_mmap;
mparams.use_mlock = params.use_mlock;
mparams.check_tensors = params.check_tensors;
mparams.use_extra_bufts = !params.no_extra_bufts;
mparams.no_host = params.no_host;
if (params.kv_overrides.empty()) {
mparams.kv_overrides = NULL;
@@ -1083,6 +1149,9 @@ struct llama_model_params common_model_params_to_llama(common_params & params) {
mparams.tensor_buft_overrides = params.tensor_buft_overrides.data();
}
mparams.progress_callback = params.load_progress_callback;
mparams.progress_callback_user_data = params.load_progress_callback_user_data;
return mparams;
}
@@ -1107,18 +1176,14 @@ struct llama_context_params common_context_params_to_llama(const common_params &
cparams.yarn_orig_ctx = params.yarn_orig_ctx;
cparams.pooling_type = params.pooling_type;
cparams.attention_type = params.attention_type;
cparams.defrag_thold = params.defrag_thold;
cparams.flash_attn_type = params.flash_attn_type;
cparams.cb_eval = params.cb_eval;
cparams.cb_eval_user_data = params.cb_eval_user_data;
cparams.offload_kqv = !params.no_kv_offload;
cparams.flash_attn = params.flash_attn;
cparams.no_perf = params.no_perf;
cparams.op_offload = !params.no_op_offload;
if (params.reranking) {
cparams.embeddings = true;
cparams.pooling_type = LLAMA_POOLING_TYPE_RANK;
}
cparams.swa_full = params.swa_full;
cparams.kv_unified = params.kv_unified;
cparams.type_k = params.cache_type_k;
cparams.type_v = params.cache_type_v;
@@ -1252,6 +1317,9 @@ std::vector<llama_token> common_tokenize(
int n_tokens = text.length() + 2 * add_special;
std::vector<llama_token> result(n_tokens);
n_tokens = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
if (n_tokens == std::numeric_limits<int32_t>::min()) {
throw std::runtime_error("Tokenization failed: input text too large, tokenization result exceeds int32_t limit");
}
if (n_tokens < 0) {
result.resize(-n_tokens);
int check = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
@@ -1306,81 +1374,6 @@ std::string common_detokenize(const struct llama_vocab * vocab, const std::vecto
return text;
}
//
// KV cache utils
//
void common_kv_cache_dump_view(const llama_kv_cache_view & view, int row_size) {
static const char slot_chars[] = ".123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+";
printf("=== Dumping KV cache. total cells %d, max sequences per cell %d, populated cells %d, total tokens in cache %d, largest empty slot=%d @ %d",
view.n_cells, view.n_seq_max, view.used_cells, view.token_count, view.max_contiguous, view.max_contiguous_idx);
llama_kv_cache_view_cell * c_curr = view.cells;
llama_seq_id * cs_curr = view.cells_sequences;
for (int i = 0; i < view.n_cells; i++, c_curr++, cs_curr += view.n_seq_max) {
if (i % row_size == 0) {
printf("\n%5d: ", i);
}
int seq_count = 0;
for (int j = 0; j < view.n_seq_max; j++) {
if (cs_curr[j] >= 0) { seq_count++; }
}
putchar(slot_chars[std::min(sizeof(slot_chars) - 2, size_t(seq_count))]);
}
printf("\n=== Done dumping\n");
}
void common_kv_cache_dump_view_seqs(const llama_kv_cache_view & view, int row_size) {
static const char slot_chars[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
printf("=== Dumping KV cache. total cells %d, max sequences per cell %d, populated cells %d, total tokens in cache %d, largest empty slot=%d @ %d\n",
view.n_cells, view.n_seq_max, view.used_cells, view.token_count, view.max_contiguous, view.max_contiguous_idx);
std::unordered_map<llama_seq_id, size_t> seqs;
llama_kv_cache_view_cell * c_curr = view.cells;
llama_seq_id * cs_curr = view.cells_sequences;
for (int i = 0; i < view.n_cells; i++, c_curr++, cs_curr += view.n_seq_max) {
for (int j = 0; j < view.n_seq_max; j++) {
if (cs_curr[j] < 0) { continue; }
if (seqs.find(cs_curr[j]) == seqs.end()) {
if (seqs.size() + 1 >= sizeof(slot_chars)) { break; }
const size_t sz = seqs.size();
seqs[cs_curr[j]] = sz;
}
}
if (seqs.size() + 1 >= sizeof(slot_chars)) { break; }
}
printf("=== Sequence legend: ");
for (const auto & it : seqs) {
printf("%zu=%d, ", it.second, it.first);
}
printf("'+'=other sequence ids");
c_curr = view.cells;
cs_curr = view.cells_sequences;
for (int i = 0; i < view.n_cells; i++, c_curr++, cs_curr += view.n_seq_max) {
if (i % row_size == 0) {
printf("\n%5d: ", i);
}
for (int j = 0; j < view.n_seq_max; j++) {
if (cs_curr[j] >= 0) {
const auto & it = seqs.find(cs_curr[j]);
putchar(it != seqs.end() ? int(slot_chars[it->second]) : '+');
} else {
putchar('.');
}
}
putchar(' ');
}
printf("\n=== Done dumping\n");
}
//
// Embedding utils
//
@@ -1582,3 +1575,56 @@ ggml_opt_dataset_t common_opt_dataset_init(struct llama_context * ctx, const std
return result;
}
ggml_opt_optimizer_params common_opt_lr_pars(void * userdata) {
ggml_opt_optimizer_params result = ggml_opt_get_default_optimizer_params(nullptr);
const lr_opt & d = *(lr_opt *) userdata;
result.adamw.alpha = result.sgd.alpha = d.get_lr(d.epoch);
result.sgd.wd = result.adamw.wd = d.wd;
return result;
}
// TODO make all command line args case-insensitive
static inline bool eq_case_insensitive(char const* a, char const* b) {
return !
#if defined(_MSC_VER)
_stricmp
#else
strcasecmp
#endif // defined(_MSC_VER)
(a, b);
}
enum ggml_opt_optimizer_type common_opt_get_optimizer(const char * n) {
if (eq_case_insensitive("adamw", n)) {
return GGML_OPT_OPTIMIZER_TYPE_ADAMW;
}
if (eq_case_insensitive("sgd", n)) {
return GGML_OPT_OPTIMIZER_TYPE_SGD;
}
return GGML_OPT_OPTIMIZER_TYPE_COUNT;
}
// TODO simplify to use just log and exp
static float const k_log_2 = std::log(2.f);
void lr_opt::init() {
if (lr_min > 0 && lr_min < lr0) {
float nhalf = std::log(lr0 / lr_min) / k_log_2;
float e = epochs;
if (decay_epochs > 0 && decay_epochs < e) {
e = decay_epochs;
} else {
decay_epochs = e;
}
scale_epoch = nhalf / e;
}
}
float lr_opt::get_lr(float epoch) const {
float r = lr_min <= 0 ? lr0 :
epoch >= decay_epochs ? lr_min :
lr0 * std::pow(0.5f, epoch * scale_epoch);
LOG_INF("epoch %.2g lr=%.2g\n", epoch, r);
return r;
}

View File

@@ -1,6 +1,6 @@
package common
// #cgo CXXFLAGS: -std=c++11
// #cgo CPPFLAGS: -I${SRCDIR}/../include
// #cgo CXXFLAGS: -std=c++17
// #cgo CPPFLAGS: -I${SRCDIR}/../include -I${SRCDIR}/../vendor
// #cgo CPPFLAGS: -I${SRCDIR}/../../../ml/backend/ggml/ggml/include
import "C"

View File

@@ -2,12 +2,17 @@
#pragma once
#include "llama-cpp.h"
#include <set>
#include <string>
#include <vector>
#include <sstream>
#include <string>
#include <string_view>
#include <vector>
#include <map>
#include <sstream>
#include <cmath>
#include "ggml-opt.h"
#include "llama-cpp.h"
#ifdef _WIN32
#define DIRECTORY_SEPARATOR '\\'
@@ -29,6 +34,9 @@ struct common_adapter_lora_info {
std::string path;
float scale;
std::string task_name;
std::string prompt_prefix;
struct llama_adapter_lora * ptr;
};
@@ -75,10 +83,12 @@ enum llama_example {
LLAMA_EXAMPLE_SERVER,
LLAMA_EXAMPLE_CVECTOR_GENERATOR,
LLAMA_EXAMPLE_EXPORT_LORA,
LLAMA_EXAMPLE_LLAVA,
LLAMA_EXAMPLE_MTMD,
LLAMA_EXAMPLE_LOOKUP,
LLAMA_EXAMPLE_PARALLEL,
LLAMA_EXAMPLE_TTS,
LLAMA_EXAMPLE_DIFFUSION,
LLAMA_EXAMPLE_FINETUNE,
LLAMA_EXAMPLE_COUNT,
};
@@ -114,7 +124,7 @@ enum common_grammar_trigger_type {
COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN,
COMMON_GRAMMAR_TRIGGER_TYPE_WORD,
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN,
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_START,
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL,
};
struct common_grammar_trigger {
@@ -175,17 +185,19 @@ struct common_params_sampling {
std::vector<common_grammar_trigger> grammar_triggers; // optional triggers (for lazy grammars)
std::set<llama_token> preserved_tokens;
std::vector<llama_logit_bias> logit_bias; // logit biases to apply
std::vector<llama_logit_bias> logit_bias; // logit biases to apply
std::vector<llama_logit_bias> logit_bias_eog; // pre-calculated logit biases for EOG tokens
// print the parameters into a string
std::string print() const;
};
struct common_params_model {
std::string path = ""; // model local path // NOLINT
std::string url = ""; // model url to download // NOLINT
std::string hf_repo = ""; // HF repo // NOLINT
std::string hf_file = ""; // HF file // NOLINT
std::string path = ""; // model local path // NOLINT
std::string url = ""; // model url to download // NOLINT
std::string hf_repo = ""; // HF repo // NOLINT
std::string hf_file = ""; // HF file // NOLINT
std::string docker_repo = ""; // Docker repo // NOLINT
};
struct common_params_speculative {
@@ -197,6 +209,11 @@ struct common_params_speculative {
int32_t n_gpu_layers = -1; // number of layers to store in VRAM for the draft model (-1 - use default)
float p_split = 0.1f; // speculative decoding split probability
float p_min = 0.75f; // minimum speculative decoding probability (greedy)
std::vector<std::pair<std::string, std::string>> replacements; // main to speculative model replacements
std::vector<llama_model_tensor_buft_override> tensor_buft_overrides;
ggml_type cache_type_k = GGML_TYPE_F16; // KV cache data type for the K
ggml_type cache_type_v = GGML_TYPE_F16; // KV cache data type for the V
struct cpu_params cpuparams;
struct cpu_params cpuparams_batch;
@@ -212,11 +229,50 @@ struct common_params_vocoder {
bool use_guide_tokens = false; // enable guide tokens to improve TTS accuracy // NOLINT
};
struct common_params_diffusion {
int32_t steps = 128;
bool visual_mode = false;
float eps = 0; // epsilon for timesteps
int32_t block_length = 0; // block length for generation
int32_t algorithm = 4; // default algorithm: low-confidence
float alg_temp = 0.0f; // algorithm temperature
float cfg_scale = 0; // classifier-free guidance scale
bool add_gumbel_noise = false; // add gumbel noise to the logits if temp > 0.0
};
// reasoning API response format (not to be confused as chat template's reasoning format)
enum common_reasoning_format {
COMMON_REASONING_FORMAT_NONE,
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`
COMMON_REASONING_FORMAT_AUTO, // Same as deepseek, using `message.reasoning_content`
COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY, // Extract thinking tag contents and return as `message.reasoning_content`, or leave inline in <think> tags in stream mode
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
// do not extend this enum unless you absolutely have to
// in most cases, use COMMON_REASONING_FORMAT_AUTO
// see: https://github.com/ggml-org/llama.cpp/pull/15408
};
struct lr_opt {
float lr0 = 1e-5; // learning rate at first epoch
float lr_min = -1;
float decay_epochs = -1; // if >0, the learning rate starts at lr0 and decays to lr_min after this many epochs
float scale_epoch = 0;
float wd = 0;
unsigned epochs = 2;
unsigned epoch; // set by optimizer outer (epochs) loop
// learning rate decay - constant LR per epoch only for now
float get_lr(float e) const;
float get_lr() const { return get_lr(epoch); }
// must call after arg parse, before get_lr
void init();
};
struct ggml_opt_optimizer_params common_opt_lr_pars(void * userdata);
struct common_params {
int32_t n_predict = -1; // new tokens to predict
int32_t n_ctx = 4096; // context size
@@ -232,11 +288,10 @@ struct common_params {
float rope_freq_base = 0.0f; // RoPE base frequency
float rope_freq_scale = 0.0f; // RoPE frequency scaling factor
float yarn_ext_factor = -1.0f; // YaRN extrapolation mix factor
float yarn_attn_factor = 1.0f; // YaRN magnitude scaling factor
float yarn_beta_fast = 32.0f; // YaRN low correction dim
float yarn_beta_slow = 1.0f; // YaRN high correction dim
float yarn_attn_factor = -1.0f; // YaRN magnitude scaling factor
float yarn_beta_fast = -1.0f; // YaRN low correction dim
float yarn_beta_slow = -1.0f; // YaRN high correction dim
int32_t yarn_orig_ctx = 0; // YaRN original context length
float defrag_thold = 0.1f; // KV cache defragmentation threshold
// offload params
std::vector<ggml_backend_dev_t> devices; // devices to use for offloading
@@ -258,10 +313,12 @@ struct common_params {
enum llama_rope_scaling_type rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED;
enum llama_pooling_type pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED; // pooling type for embeddings
enum llama_attention_type attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED; // attention type for embeddings
enum llama_flash_attn_type flash_attn_type = LLAMA_FLASH_ATTN_TYPE_AUTO; // whether to use Flash Attention
struct common_params_sampling sampling;
struct common_params_speculative speculative;
struct common_params_vocoder vocoder;
struct common_params_diffusion diffusion;
struct common_params_model model;
@@ -290,6 +347,7 @@ struct common_params {
int32_t verbosity = 0;
int32_t control_vector_layer_start = -1; // layer range for control vector
int32_t control_vector_layer_end = -1; // layer range for control vector
bool offline = false;
int32_t ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
int32_t ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
@@ -319,20 +377,22 @@ struct common_params {
bool multiline_input = false; // reverse the usage of `\`
bool simple_io = false; // improves compatibility with subprocesses and limited consoles
bool cont_batching = true; // insert new sequences for decoding on-the-fly
bool flash_attn = false; // flash attention
bool no_perf = false; // disable performance metrics
bool ctx_shift = true; // context shift on inifinite text generation
bool ctx_shift = false; // context shift on infinite text generation
bool swa_full = false; // use full-size SWA cache (https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
bool kv_unified = false; // enable unified KV cache
bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix
bool use_mmap = true; // use mmap for faster loads
bool use_mlock = false; // use mlock to keep model in memory
bool verbose_prompt = false; // print prompt tokens before generation
bool display_prompt = true; // print prompt before generation
bool dump_kv_cache = false; // dump the KV cache contents for debugging purposes
bool no_kv_offload = false; // disable KV offloading
bool warmup = true; // warmup run
bool check_tensors = false; // validate tensor data
bool no_op_offload = false; // globally disable offload host tensor operations to device
bool no_extra_bufts = false; // disable extra buffer types (used for weight repacking)
bool no_host = false; // bypass host buffer allowing extra buffers to be used
bool single_turn = false; // single turn chat conversation
@@ -347,35 +407,47 @@ struct common_params {
bool no_mmproj = false; // explicitly disable multimodal model
std::vector<std::string> image; // path to image file(s)
// finetune
struct lr_opt lr;
enum ggml_opt_optimizer_type optimizer = GGML_OPT_OPTIMIZER_TYPE_ADAMW;
float val_split = 0.05f; // fraction of the data used for the validation set
// embedding
bool embedding = false; // get only sentence embedding
int32_t embd_normalize = 2; // normalisation for embeddings (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm)
std::string embd_out = ""; // empty = default, "array" = [[],[]...], "json" = openai style, "json+" = same "json" + cosine similarity matrix
std::string embd_sep = "\n"; // separator of embeddings
bool reranking = false; // enable reranking support on server
std::string cls_sep = "\t"; // separator of classification sequences
// server params
int32_t port = 8080; // server listens on this network port
int32_t timeout_read = 600; // http read timeout in seconds
int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting
int32_t port = 8080; // server listens on this network port
int32_t timeout_read = 600; // http read timeout in seconds
int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting
int32_t n_ctx_checkpoints = 8; // max number of context checkpoints per slot
int32_t cache_ram_mib = 8192; // -1 = no limit, 0 - disable, 1 = 1 MiB, etc.
std::string hostname = "127.0.0.1";
std::string public_path = ""; // NOLINT
std::string api_prefix = ""; // NOLINT
std::string chat_template = ""; // NOLINT
bool use_jinja = false; // NOLINT
bool enable_chat_template = true;
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
int reasoning_budget = -1;
bool prefill_assistant = true; // if true, any trailing assistant message will be prefilled into the response
std::vector<std::string> api_keys;
std::string ssl_file_key = ""; // NOLINT
std::string ssl_file_cert = ""; // NOLINT
std::map<std::string, std::string> default_template_kwargs;
// "advanced" endpoints are disabled by default for better security
bool webui = true;
bool endpoint_slots = false;
bool endpoint_slots = true;
bool endpoint_props = false; // only control POST requests, not GET
bool endpoint_metrics = false;
@@ -383,7 +455,7 @@ struct common_params {
std::string slot_save_path;
float slot_prompt_similarity = 0.5f;
float slot_prompt_similarity = 0.1f;
// batched-bench params
bool is_pp_shared = false;
@@ -407,10 +479,12 @@ struct common_params {
int32_t n_out_freq = 10; // output the imatrix every n_out_freq iterations
int32_t n_save_freq = 0; // save the imatrix every n_save_freq iterations
int32_t i_chunk = 0; // start processing from this chunk
int8_t imat_dat = 0; // whether the legacy imatrix.dat format should be output (gguf <= 0 < dat)
bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool show_statistics = false; // show imatrix statistics per tensor
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
// cvector-generator params
int n_pca_batch = 100;
@@ -426,6 +500,11 @@ struct common_params {
// common params
std::string out_file; // output filename for all example programs
// optional callback for model loading progress and cancellation:
// called with a progress value between 0.0 and 1.0.
// return false from callback to abort model loading or true to continue
llama_progress_callback load_progress_callback = NULL;
void * load_progress_callback_user_data = NULL;
};
// call once at the start of a program if it uses libcommon
@@ -503,10 +582,10 @@ static bool string_starts_with(const std::string & str,
return str.rfind(prefix, 0) == 0;
}
static bool string_ends_with(const std::string & str,
const std::string & suffix) { // While we wait for C++20's std::string::ends_with...
return str.size() >= suffix.size() && str.compare(str.size()-suffix.size(), suffix.size(), suffix) == 0;
}
// While we wait for C++20's std::string::ends_with...
bool string_ends_with(const std::string_view & str, const std::string_view & suffix);
bool string_remove_suffix(std::string & str, const std::string_view & suffix);
size_t string_find_partial_stop(const std::string_view & str, const std::string_view & stop);
bool string_parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
void string_process_escapes(std::string & input);
@@ -615,16 +694,6 @@ std::string common_detokenize(
const std::vector<llama_token> & tokens,
bool special = true);
//
// KV cache utils
//
// Dump the KV cache view with the number of sequences per cell.
void common_kv_cache_dump_view(const llama_kv_cache_view & view, int row_size = 80);
// Dump the KV cache view showing individual sequences in each cell (long output).
void common_kv_cache_dump_view_seqs(const llama_kv_cache_view & view, int row_size = 40);
//
// Embedding utils
//
@@ -667,8 +736,25 @@ const char * const LLM_KV_SPLIT_TENSORS_COUNT = "split.tensors.count";
}
//
// MoE utils
//
const char * const LLM_FFN_EXPS_REGEX = "\\.ffn_(up|down|gate)_(ch|)exps";
static std::string llm_ffn_exps_block_regex(int idx) {
return string_format("blk\\.%d%s", idx, LLM_FFN_EXPS_REGEX);
}
static llama_model_tensor_buft_override llm_ffn_exps_cpu_override() {
return { LLM_FFN_EXPS_REGEX, ggml_backend_cpu_buffer_type() };
}
//
// training utils
//
ggml_opt_dataset_t common_opt_dataset_init(struct llama_context * ctx, const std::vector<llama_token> & tokens, int64_t stride);
// "adamw" or "sgd" (case insensitive)
enum ggml_opt_optimizer_type common_opt_get_optimizer(const char *);

View File

@@ -1,8 +1,9 @@
#include "json-schema-to-grammar.h"
#include "common.h"
#include <nlohmann/json.hpp>
#include <algorithm>
#include <fstream>
#include <map>
#include <regex>
#include <sstream>
@@ -40,49 +41,6 @@ static std::string build_repetition(const std::string & item_rule, int min_items
return result;
}
/* Minimalistic replacement for std::string_view, which is only available from C++17 onwards */
class string_view {
const std::string & _str;
const size_t _start;
const size_t _end;
public:
string_view(const std::string & str, size_t start = 0, size_t end = std::string::npos) : _str(str), _start(start), _end(end == std::string::npos ? str.length() : end) {}
size_t size() const {
return _end - _start;
}
size_t length() const {
return size();
}
operator std::string() const {
return str();
}
std::string str() const {
return _str.substr(_start, _end - _start);
}
string_view substr(size_t pos, size_t len = std::string::npos) const {
return string_view(_str, _start + pos, len == std::string::npos ? _end : _start + pos + len);
}
char operator[](size_t pos) const {
auto index = _start + pos;
if (index >= _end) {
throw std::out_of_range("string_view index out of range");
}
return _str[_start + pos];
}
bool operator==(const string_view & other) const {
std::string this_str = *this;
std::string other_str = other;
return this_str == other_str;
}
};
static void _build_min_max_int(int min_value, int max_value, std::stringstream & out, int decimals_left = 16, bool top_level = true) {
auto has_min = min_value != std::numeric_limits<int>::min();
auto has_max = max_value != std::numeric_limits<int>::max();
@@ -111,14 +69,14 @@ static void _build_min_max_int(int min_value, int max_value, std::stringstream &
}
out << "}";
};
std::function<void(const string_view &, const string_view &)> uniform_range =
[&](const string_view & from, const string_view & to) {
std::function<void(const std::string_view &, const std::string_view &)> uniform_range =
[&](const std::string_view & from, const std::string_view & to) {
size_t i = 0;
while (i < from.length() && i < to.length() && from[i] == to[i]) {
i++;
}
if (i > 0) {
out << "\"" << from.substr(0, i).str() << "\"";
out << "\"" << from.substr(0, i) << "\"";
}
if (i < from.length() && i < to.length()) {
if (i > 0) {
@@ -299,12 +257,13 @@ std::unordered_map<std::string, BuiltinRule> STRING_FORMAT_RULES = {
};
static bool is_reserved_name(const std::string & name) {
static std::unordered_set<std::string> RESERVED_NAMES;
if (RESERVED_NAMES.empty()) {
RESERVED_NAMES.insert("root");
for (const auto &p : PRIMITIVE_RULES) RESERVED_NAMES.insert(p.first);
for (const auto &p : STRING_FORMAT_RULES) RESERVED_NAMES.insert(p.first);
}
static const std::unordered_set<std::string> RESERVED_NAMES = [] {
std::unordered_set<std::string> s;
s.insert("root");
for (const auto & p : PRIMITIVE_RULES) s.insert(p.first);
for (const auto & p : STRING_FORMAT_RULES) s.insert(p.first);
return s;
}();
return RESERVED_NAMES.find(name) != RESERVED_NAMES.end();
}
@@ -885,9 +844,10 @@ public:
_build_object_rule(
properties, required, name,
schema.contains("additionalProperties") ? schema["additionalProperties"] : json()));
} else if ((schema_type.is_null() || schema_type == "object") && schema.contains("allOf")) {
} else if ((schema_type.is_null() || schema_type == "object" || schema_type == "string") && schema.contains("allOf")) {
std::unordered_set<std::string> required;
std::vector<std::pair<std::string, json>> properties;
std::map<std::string, size_t> enum_values;
std::string hybrid_name = name;
std::function<void(const json &, bool)> add_component = [&](const json & comp_schema, bool is_required) {
if (comp_schema.contains("$ref")) {
@@ -899,6 +859,14 @@ public:
required.insert(prop.key());
}
}
} else if (comp_schema.contains("enum")) {
for (const auto & v : comp_schema["enum"]) {
const auto rule = _generate_constant_rule(v);
if (enum_values.find(rule) == enum_values.end()) {
enum_values[rule] = 0;
}
enum_values[rule] += 1;
}
} else {
// todo warning
}
@@ -912,6 +880,17 @@ public:
add_component(t, true);
}
}
if (!enum_values.empty()) {
std::vector<std::string> enum_intersection;
for (const auto & p : enum_values) {
if (p.second == schema["allOf"].size()) {
enum_intersection.push_back(p.first);
}
}
if (!enum_intersection.empty()) {
return _add_rule(rule_name, "(" + string_join(enum_intersection, " | ") + ") space");
}
}
return _add_rule(rule_name, _build_object_rule(properties, required, hybrid_name, json()));
} else if ((schema_type.is_null() || schema_type == "array") && (schema.contains("items") || schema.contains("prefixItems"))) {
json items = schema.contains("items") ? schema["items"] : schema["prefixItems"];

View File

@@ -1,9 +1,9 @@
#pragma once
#include "ggml.h"
// Change JSON_ASSERT from assert() to GGML_ASSERT:
#define JSON_ASSERT GGML_ASSERT
#include "json.hpp"
#include <nlohmann/json_fwd.hpp>
#include <functional>
#include <string>
std::string json_schema_to_grammar(const nlohmann::ordered_json & schema,
bool force_gbnf = false);

View File

@@ -4,17 +4,52 @@
#include <condition_variable>
#include <cstdarg>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <mutex>
#include <sstream>
#include <thread>
#include <vector>
#if defined(_WIN32)
# include <io.h>
# include <windows.h>
# define isatty _isatty
# define fileno _fileno
#else
# include <unistd.h>
#endif // defined(_WIN32)
int common_log_verbosity_thold = LOG_DEFAULT_LLAMA;
void common_log_set_verbosity_thold(int verbosity) {
common_log_verbosity_thold = verbosity;
}
// Auto-detect if colors should be enabled based on terminal and environment
static bool common_log_should_use_colors_auto() {
// Check NO_COLOR environment variable (https://no-color.org/)
if (const char * no_color = std::getenv("NO_COLOR")) {
if (no_color[0] != '\0') {
return false;
}
}
// Check TERM environment variable
if (const char * term = std::getenv("TERM")) {
if (std::strcmp(term, "dumb") == 0) {
return false;
}
}
// Check if stdout and stderr are connected to a terminal
// We check both because log messages can go to either
bool stdout_is_tty = isatty(fileno(stdout));
bool stderr_is_tty = isatty(fileno(stderr));
return stdout_is_tty || stderr_is_tty;
}
static int64_t t_us() {
return std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
}
@@ -353,6 +388,11 @@ struct common_log * common_log_init() {
struct common_log * common_log_main() {
static struct common_log log;
static std::once_flag init_flag;
std::call_once(init_flag, [&]() {
// Set default to auto-detect colors
log.set_colors(common_log_should_use_colors_auto());
});
return &log;
}
@@ -380,8 +420,19 @@ void common_log_set_file(struct common_log * log, const char * file) {
log->set_file(file);
}
void common_log_set_colors(struct common_log * log, bool colors) {
log->set_colors(colors);
void common_log_set_colors(struct common_log * log, log_colors colors) {
if (colors == LOG_COLORS_AUTO) {
log->set_colors(common_log_should_use_colors_auto());
return;
}
if (colors == LOG_COLORS_DISABLED) {
log->set_colors(false);
return;
}
GGML_ASSERT(colors == LOG_COLORS_ENABLED);
log->set_colors(true);
}
void common_log_set_prefix(struct common_log * log, bool prefix) {

View File

@@ -24,6 +24,12 @@
#define LOG_DEFAULT_DEBUG 1
#define LOG_DEFAULT_LLAMA 0
enum log_colors {
LOG_COLORS_AUTO = -1,
LOG_COLORS_DISABLED = 0,
LOG_COLORS_ENABLED = 1,
};
// needed by the LOG_TMPL macro to avoid computing log arguments if the verbosity lower
// set via common_log_set_verbosity()
extern int common_log_verbosity_thold;
@@ -65,10 +71,10 @@ void common_log_add(struct common_log * log, enum ggml_log_level level, const ch
// D - debug (stderr, V = LOG_DEFAULT_DEBUG)
//
void common_log_set_file (struct common_log * log, const char * file); // not thread-safe
void common_log_set_colors (struct common_log * log, bool colors); // not thread-safe
void common_log_set_prefix (struct common_log * log, bool prefix); // whether to output prefix to each log
void common_log_set_timestamps(struct common_log * log, bool timestamps); // whether to output timestamps in the prefix
void common_log_set_file (struct common_log * log, const char * file); // not thread-safe
void common_log_set_colors (struct common_log * log, log_colors colors); // not thread-safe
void common_log_set_prefix (struct common_log * log, bool prefix); // whether to output prefix to each log
void common_log_set_timestamps(struct common_log * log, bool timestamps); // whether to output timestamps in the prefix
// helper macros for logging
// use these to avoid computing log arguments if the verbosity of the log is higher than the threshold

View File

@@ -161,7 +161,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
GGML_ABORT("llguidance (cmake -DLLAMA_LLGUIDANCE=ON) is not enabled");
#endif // LLAMA_USE_LLGUIDANCE
} else {
std::vector<std::string> patterns_at_start;
std::vector<std::string> trigger_patterns;
std::vector<std::string> patterns_anywhere;
std::vector<llama_token> trigger_tokens;
for (const auto & trigger : params.grammar_triggers) {
@@ -173,10 +173,13 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
break;
}
case COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN:
case COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_START:
{
const auto & pattern = trigger.value;
(trigger.type == COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_START ? patterns_at_start : patterns_anywhere).push_back(pattern);
patterns_anywhere.push_back(trigger.value);
break;
}
case COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL:
{
trigger_patterns.push_back(trigger.value);
break;
}
case COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN:
@@ -190,10 +193,6 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
}
}
std::vector<std::string> trigger_patterns;
if (!patterns_at_start.empty()) {
trigger_patterns.push_back("^(" + string_join(patterns_at_start, "|") + ")[\\s\\S]*");
}
if (!patterns_anywhere.empty()) {
trigger_patterns.push_back("^[\\s\\S]*?(" + string_join(patterns_anywhere, "|") + ")[\\s\\S]*");
}
@@ -333,6 +332,7 @@ void common_perf_print(const struct llama_context * ctx, const struct common_sam
}
if (ctx) {
llama_perf_context_print(ctx);
llama_memory_breakdown_print(ctx);
}
}
@@ -427,8 +427,29 @@ uint32_t common_sampler_get_seed(const struct common_sampler * gsmpl) {
// helpers
llama_token_data_array * common_sampler_get_candidates(struct common_sampler * gsmpl) {
return &gsmpl->cur_p;
llama_token_data_array * common_sampler_get_candidates(struct common_sampler * gsmpl, bool do_sort) {
auto * res = &gsmpl->cur_p;
if (do_sort && !res->sorted) {
// remember the selected token before sorting
const llama_token id = res->data[res->selected].id;
std::sort(res->data, res->data + res->size, [](const llama_token_data & a, const llama_token_data & b) {
return a.p > b.p;
});
// restore the selected token after sorting
for (size_t i = 0; i < res->size; ++i) {
if (res->data[i].id == id) {
res->selected = i;
break;
}
}
res->sorted = true;
}
return res;
}
llama_token common_sampler_last(const struct common_sampler * gsmpl) {

View File

@@ -86,7 +86,9 @@ uint32_t common_sampler_get_seed(const struct common_sampler * gsmpl);
// helpers
// access the internal list of current candidate tokens
llama_token_data_array * common_sampler_get_candidates(struct common_sampler * gsmpl);
// if do_sort == true, the candidates are guaranteed to be sorted afterwards (in descending order of probability)
// the .sorted flag of the result indicates whether the returned candidates are sorted
llama_token_data_array * common_sampler_get_candidates(struct common_sampler * gsmpl, bool do_sort);
// get the last accepted token
llama_token common_sampler_last(const struct common_sampler * gsmpl);

Some files were not shown because too many files have changed in this diff Show More