mirror of https://github.com/ollama/ollama.git synced 2026-02-06 13:43:39 -05:00

Files

Jesse Gross c61023f554 ollamarunner: Fix off by one error with numPredict

When numPredict is set, the user will receive one less token
than the requested limit. In addition, the stats will incorrectly
show the number of tokens returned as the limit. In cases where
numPredict is not set, the number of tokens is reported correctly.

This occurs because numPredict is checked when setting up the next
batch but hitting the limit will terminate the current batch as well.
Instead, is is better to check the limit as we actually predict them.

2026-02-04 17:14:24 -08:00

common

server: add logprobs and top_logprobs support to Ollama's API (#12899 )

2025-11-11 08:49:50 -08:00

llamarunner

flash attn: add auto mode for llama engine (#13052 )

2025-12-12 13:27:19 -08:00

ollamarunner

ollamarunner: Fix off by one error with numPredict

2026-02-04 17:14:24 -08:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

glm 4.7 flash support on experimental engine (#13838 )

2026-02-02 15:22:11 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding