remove .First

Merge pull request #1132 from jmorganca/mxyng/human-bytes
replace go-humanize with format.HumanBytes
2026-02-07 14:13:29 -05:00 · 2023-11-15 18:07:13 -05:00 · 2023-11-15 09:46:21 -08:00 · 2023-11-15 12:32:37 -05:00 · 2023-11-14 14:57:41 -08:00 · 2023-11-14 16:42:21 -05:00
60 changed files with 2721 additions and 2704 deletions
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ curl https://ollama.ai/install.sh | sh

 ### Docker

-See the official [Docker image](https://hub.docker.com/r/ollama/ollama).
+The official [Ollama Docker image](https://hub.docker.com/r/ollama/ollama) `ollama/ollama` is available on Docker Hub.

 ## Quickstart

@@ -88,7 +88,7 @@ See the [guide](docs/import.md) on importing models for more information.

 ### Customize a prompt

-Models from the Ollama library can be customized with a prompt. The example
+Models from the Ollama library can be customized with a prompt. For example, to customize the `llama2` model:

 ```
 ollama pull llama2
@@ -159,7 +159,7 @@ I'm a basic program that prints the famous "Hello, world!" message to the consol
 ### Pass in prompt as arguments

 ```
-$ ollama run llama2 "summarize this file:" "$(cat README.md)"
+$ ollama run llama2 "Summarize this file: $(cat README.md)"
 Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.
 ```

@@ -178,8 +178,7 @@ ollama list
 Install `cmake` and `go`:

 ```
-brew install cmake
-brew install go
+brew install cmake go
 ```

 Then generate dependencies and build:
@@ -203,9 +202,8 @@ Finally, in a separate shell, run a model:

 ## REST API

-See the [API documentation](docs/api.md) for all endpoints.
-
-Ollama has an API for running and managing models. For example to generate text from a model:
+Ollama has a REST API for running and managing models.
+For example, to generate text from a model:

 ```
 curl -X POST http://localhost:11434/api/generate -d '{
@@ -214,22 +212,48 @@ curl -X POST http://localhost:11434/api/generate -d '{
 }'
 ```

+See the [API documentation](./docs/api.md) for all endpoints.
+
 ## Community Integrations

+### Web & Desktop
+
+- [HTML UI](https://github.com/rtcfirefly/ollama-ui)
+- [Chatbot UI](https://github.com/ivanfioravanti/chatbot-ollama)
+- [Typescript UI](https://github.com/ollama-interface/Ollama-Gui?tab=readme-ov-file)
+- [Minimalistic React UI for Ollama Models](https://github.com/richawo/minimal-llm-ui)
+- [Web UI](https://github.com/ollama-webui/ollama-webui)
+- [Ollamac](https://github.com/kevinhermawan/Ollamac)
+- [big-AGI](https://github.com/enricoros/big-agi/blob/main/docs/config-ollama.md)
+
+### Terminal
+
+- [oterm](https://github.com/ggozad/oterm)
+- [Ellama Emacs client](https://github.com/s-kostyaev/ellama)
+- [Emacs client](https://github.com/zweifisch/ollama)
+- [gen.nvim](https://github.com/David-Kunz/gen.nvim)
+- [ollama.nvim](https://github.com/nomnivore/ollama.nvim)
+- [gptel Emacs client](https://github.com/karthink/gptel)
+
+### Libraries
+
 - [LangChain](https://python.langchain.com/docs/integrations/llms/ollama) and [LangChain.js](https://js.langchain.com/docs/modules/model_io/models/llms/integrations/ollama) with [example](https://js.langchain.com/docs/use_cases/question_answering/local_retrieval_qa)
 - [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/llm/ollama.html)
+- [LiteLLM](https://github.com/BerriAI/litellm)
+- [OllamaSharp for .NET](https://github.com/awaescher/OllamaSharp)
+- [Ollama-rs for Rust](https://github.com/pepperoni21/ollama-rs)
+- [Ollama4j for Java](https://github.com/amithkoujalgi/ollama4j)
+- [ModelFusion Typescript Library](https://modelfusion.dev/integration/model-provider/ollama)
+- [OllamaKit for Swift](https://github.com/kevinhermawan/OllamaKit)
+- [Ollama for Dart](https://github.com/breitburg/dart-ollama)
+
+### Extensions & Plugins
+
 - [Raycast extension](https://github.com/MassimilianoPasquini97/raycast_ollama)
 - [Discollama](https://github.com/mxyng/discollama) (Discord bot inside the Ollama discord channel)
 - [Continue](https://github.com/continuedev/continue)
 - [Obsidian Ollama plugin](https://github.com/hinterdupfinger/obsidian-ollama)
+- [Logseq Ollama plugin](https://github.com/omagdy7/ollama-logseq)
 - [Dagger Chatbot](https://github.com/samalba/dagger-chatbot)
- [LiteLLM](https://github.com/BerriAI/litellm)
 - [Discord AI Bot](https://github.com/mekb-turtle/discord-ai-bot)
- [Chatbot UI](https://github.com/ivanfioravanti/chatbot-ollama)
- [HTML UI](https://github.com/rtcfirefly/ollama-ui)
- [Typescript UI](https://github.com/ollama-interface/Ollama-Gui?tab=readme-ov-file)
- [Dumbar](https://github.com/JerrySievert/Dumbar)
- [Emacs client](https://github.com/zweifisch/ollama)
- [oterm](https://github.com/ggozad/oterm)
- [Ellama Emacs client](https://github.com/s-kostyaev/ellama)
- [OllamaSharp for .NET](https://github.com/awaescher/OllamaSharp)
+- [Hass Ollama Conversation](https://github.com/ej52/hass-ollama-conversation)
--- a/api/client.go
+++ b/api/client.go
@@ -18,10 +18,6 @@ import (
 	"github.com/jmorganca/ollama/version"
 )

-const DefaultHost = "127.0.0.1:11434"
-
-var envHost = os.Getenv("OLLAMA_HOST")
-
 type Client struct {
 	base *url.URL
 	http http.Client
@@ -44,14 +40,24 @@ func checkError(resp *http.Response, body []byte) error {
 }

 func ClientFromEnvironment() (*Client, error) {
+	defaultPort := "11434"
+
 	scheme, hostport, ok := strings.Cut(os.Getenv("OLLAMA_HOST"), "://")
-	if !ok {
+	switch {
+	case !ok:
 		scheme, hostport = "http", os.Getenv("OLLAMA_HOST")
+	case scheme == "http":
+		defaultPort = "80"
+	case scheme == "https":
+		defaultPort = "443"
 	}

+	// trim trailing slashes
+	hostport = strings.TrimRight(hostport, "/")
+
 	host, port, err := net.SplitHostPort(hostport)
 	if err != nil {
-		host, port = "127.0.0.1", "11434"
+		host, port = "127.0.0.1", defaultPort
 		if ip := net.ParseIP(strings.Trim(hostport, "[]")); ip != nil {
 			host = ip.String()
 		} else if hostport != "" {
@@ -66,7 +72,7 @@ func ClientFromEnvironment() (*Client, error) {
 		},
 	}

-	mockRequest, err := http.NewRequest("HEAD", client.base.String(), nil)
+	mockRequest, err := http.NewRequest(http.MethodHead, client.base.String(), nil)
 	if err != nil {
 		return nil, err
 	}
--- a/api/client.py
+++ b/api/client.py
@@ -7,7 +7,7 @@ BASE_URL = os.environ.get('OLLAMA_HOST', 'http://localhost:11434')
 # Generate a response for a given prompt with a provided model. This is a streaming endpoint, so will be a series of responses.
 # The final response object will include statistics and additional data from the request. Use the callback function to override
 # the default handler.
-def generate(model_name, prompt, system=None, template=None, context=None, options=None, callback=None):
+def generate(model_name, prompt, system=None, template=None, format="", context=None, options=None, callback=None):
    try:
        url = f"{BASE_URL}/api/generate"
        payload = {
@@ -16,7 +16,8 @@ def generate(model_name, prompt, system=None, template=None, context=None, optio
            "system": system, 
            "template": template, 
            "context": context, 
-            "options": options
+            "options": options,
+            "format": format,
        }
        
        # Remove keys with None values
--- a/api/client_test.go
+++ b/api/client_test.go
@@ -0,0 +1,43 @@
+package api
+
+import "testing"
+
+func TestClientFromEnvironment(t *testing.T) {
+	type testCase struct {
+		value  string
+		expect string
+		err    error
+	}
+
+	testCases := map[string]*testCase{
+		"empty":                      {value: "", expect: "http://127.0.0.1:11434"},
+		"only address":               {value: "1.2.3.4", expect: "http://1.2.3.4:11434"},
+		"only port":                  {value: ":1234", expect: "http://:1234"},
+		"address and port":           {value: "1.2.3.4:1234", expect: "http://1.2.3.4:1234"},
+		"scheme http and address":    {value: "http://1.2.3.4", expect: "http://1.2.3.4:80"},
+		"scheme https and address":   {value: "https://1.2.3.4", expect: "https://1.2.3.4:443"},
+		"scheme, address, and port":  {value: "https://1.2.3.4:1234", expect: "https://1.2.3.4:1234"},
+		"hostname":                   {value: "example.com", expect: "http://example.com:11434"},
+		"hostname and port":          {value: "example.com:1234", expect: "http://example.com:1234"},
+		"scheme http and hostname":   {value: "http://example.com", expect: "http://example.com:80"},
+		"scheme https and hostname":  {value: "https://example.com", expect: "https://example.com:443"},
+		"scheme, hostname, and port": {value: "https://example.com:1234", expect: "https://example.com:1234"},
+		"trailing slash":             {value: "example.com/", expect: "http://example.com:11434"},
+		"trailing slash port":        {value: "example.com:1234/", expect: "http://example.com:1234"},
+	}
+
+	for k, v := range testCases {
+		t.Run(k, func(t *testing.T) {
+			t.Setenv("OLLAMA_HOST", v.value)
+
+			client, err := ClientFromEnvironment()
+			if err != v.err {
+				t.Fatalf("expected %s, got %s", v.err, err)
+			}
+
+			if client.base.String() != v.expect {
+				t.Fatalf("expected %s, got %s", v.expect, client.base.String())
+			}
+		})
+	}
+}
--- a/api/types.go
+++ b/api/types.go
@@ -37,10 +37,56 @@ type GenerateRequest struct {
 	Template string `json:"template"`
 	Context  []int  `json:"context,omitempty"`
 	Stream   *bool  `json:"stream,omitempty"`
+	Raw      bool   `json:"raw,omitempty"`
+	Format   string `json:"format"`

 	Options map[string]interface{} `json:"options"`
 }

+// Options specfied in GenerateRequest, if you add a new option here add it to the API docs also
+type Options struct {
+	Runner
+
+	// Predict options used at runtime
+	NumKeep          int      `json:"num_keep,omitempty"`
+	Seed             int      `json:"seed,omitempty"`
+	NumPredict       int      `json:"num_predict,omitempty"`
+	TopK             int      `json:"top_k,omitempty"`
+	TopP             float32  `json:"top_p,omitempty"`
+	TFSZ             float32  `json:"tfs_z,omitempty"`
+	TypicalP         float32  `json:"typical_p,omitempty"`
+	RepeatLastN      int      `json:"repeat_last_n,omitempty"`
+	Temperature      float32  `json:"temperature,omitempty"`
+	RepeatPenalty    float32  `json:"repeat_penalty,omitempty"`
+	PresencePenalty  float32  `json:"presence_penalty,omitempty"`
+	FrequencyPenalty float32  `json:"frequency_penalty,omitempty"`
+	Mirostat         int      `json:"mirostat,omitempty"`
+	MirostatTau      float32  `json:"mirostat_tau,omitempty"`
+	MirostatEta      float32  `json:"mirostat_eta,omitempty"`
+	PenalizeNewline  bool     `json:"penalize_newline,omitempty"`
+	Stop             []string `json:"stop,omitempty"`
+}
+
+// Runner options which must be set when the model is loaded into memory
+type Runner struct {
+	UseNUMA            bool    `json:"numa,omitempty"`
+	NumCtx             int     `json:"num_ctx,omitempty"`
+	NumBatch           int     `json:"num_batch,omitempty"`
+	NumGQA             int     `json:"num_gqa,omitempty"`
+	NumGPU             int     `json:"num_gpu,omitempty"`
+	MainGPU            int     `json:"main_gpu,omitempty"`
+	LowVRAM            bool    `json:"low_vram,omitempty"`
+	F16KV              bool    `json:"f16_kv,omitempty"`
+	LogitsAll          bool    `json:"logits_all,omitempty"`
+	VocabOnly          bool    `json:"vocab_only,omitempty"`
+	UseMMap            bool    `json:"use_mmap,omitempty"`
+	UseMLock           bool    `json:"use_mlock,omitempty"`
+	EmbeddingOnly      bool    `json:"embedding_only,omitempty"`
+	RopeFrequencyBase  float32 `json:"rope_frequency_base,omitempty"`
+	RopeFrequencyScale float32 `json:"rope_frequency_scale,omitempty"`
+	NumThread          int     `json:"num_thread,omitempty"`
+}
+
 type EmbeddingRequest struct {
 	Model  string `json:"model"`
 	Prompt string `json:"prompt"`
@@ -161,49 +207,6 @@ func (r *GenerateResponse) Summary() {
 	}
 }

-// Runner options which must be set when the model is loaded into memory
-type Runner struct {
-	UseNUMA            bool    `json:"numa,omitempty"`
-	NumCtx             int     `json:"num_ctx,omitempty"`
-	NumBatch           int     `json:"num_batch,omitempty"`
-	NumGQA             int     `json:"num_gqa,omitempty"`
-	NumGPU             int     `json:"num_gpu,omitempty"`
-	MainGPU            int     `json:"main_gpu,omitempty"`
-	LowVRAM            bool    `json:"low_vram,omitempty"`
-	F16KV              bool    `json:"f16_kv,omitempty"`
-	LogitsAll          bool    `json:"logits_all,omitempty"`
-	VocabOnly          bool    `json:"vocab_only,omitempty"`
-	UseMMap            bool    `json:"use_mmap,omitempty"`
-	UseMLock           bool    `json:"use_mlock,omitempty"`
-	EmbeddingOnly      bool    `json:"embedding_only,omitempty"`
-	RopeFrequencyBase  float32 `json:"rope_frequency_base,omitempty"`
-	RopeFrequencyScale float32 `json:"rope_frequency_scale,omitempty"`
-	NumThread          int     `json:"num_thread,omitempty"`
-}
-
-type Options struct {
-	Runner
-
-	// Predict options used at runtime
-	NumKeep          int      `json:"num_keep,omitempty"`
-	Seed             int      `json:"seed,omitempty"`
-	NumPredict       int      `json:"num_predict,omitempty"`
-	TopK             int      `json:"top_k,omitempty"`
-	TopP             float32  `json:"top_p,omitempty"`
-	TFSZ             float32  `json:"tfs_z,omitempty"`
-	TypicalP         float32  `json:"typical_p,omitempty"`
-	RepeatLastN      int      `json:"repeat_last_n,omitempty"`
-	Temperature      float32  `json:"temperature,omitempty"`
-	RepeatPenalty    float32  `json:"repeat_penalty,omitempty"`
-	PresencePenalty  float32  `json:"presence_penalty,omitempty"`
-	FrequencyPenalty float32  `json:"frequency_penalty,omitempty"`
-	Mirostat         int      `json:"mirostat,omitempty"`
-	MirostatTau      float32  `json:"mirostat_tau,omitempty"`
-	MirostatEta      float32  `json:"mirostat_eta,omitempty"`
-	PenalizeNewline  bool     `json:"penalize_newline,omitempty"`
-	Stop             []string `json:"stop,omitempty"`
-}
-
 var ErrInvalidOpts = fmt.Errorf("invalid options")

 func (opts *Options) FromMap(m map[string]interface{}) error {
@@ -293,7 +296,7 @@ func DefaultOptions() Options {
 	return Options{
 		// options set on request to runner
 		NumPredict:       -1,
-		NumKeep:          -1,
+		NumKeep:          0,
 		Temperature:      0.8,
 		TopK:             40,
 		TopP:             0.9,
--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@@ -1,7 +1,6 @@
 package cmd

 import (
-	"bufio"
 	"context"
 	"crypto/ed25519"
 	"crypto/rand"
@@ -11,6 +10,7 @@ import (
 	"io"
 	"log"
 	"net"
+	"net/http"
 	"os"
 	"os/exec"
 	"os/signal"
@@ -20,9 +20,7 @@ import (
 	"syscall"
 	"time"

-	"github.com/dustin/go-humanize"
 	"github.com/olekukonko/tablewriter"
-	"github.com/pdevine/readline"
 	"github.com/spf13/cobra"
 	"golang.org/x/crypto/ssh"
 	"golang.org/x/term"
@@ -30,30 +28,11 @@ import (
 	"github.com/jmorganca/ollama/api"
 	"github.com/jmorganca/ollama/format"
 	"github.com/jmorganca/ollama/progressbar"
+	"github.com/jmorganca/ollama/readline"
 	"github.com/jmorganca/ollama/server"
 	"github.com/jmorganca/ollama/version"
 )

-type Painter struct {
-	IsMultiLine bool
-}
-
-func (p Painter) Paint(line []rune, _ int) []rune {
-	termType := os.Getenv("TERM")
-	if termType == "xterm-256color" && len(line) == 0 {
-		var prompt string
-		if p.IsMultiLine {
-			prompt = "Use \"\"\" to end multi-line input"
-		} else {
-			prompt = "Send a message (/? for help)"
-		}
-		return []rune(fmt.Sprintf("\033[38;5;245m%s\033[%dD\033[0m", prompt, len(prompt)))
-	}
-	// add a space and a backspace to prevent the cursor from walking up the screen
-	line = append(line, []rune(" \b")...)
-	return line
-}
-
 func CreateHandler(cmd *cobra.Command, args []string) error {
 	filename, _ := cmd.Flags().GetString("file")
 	filename, err := filepath.Abs(filename)
@@ -118,19 +97,16 @@ func RunHandler(cmd *cobra.Command, args []string) error {
 		return err
 	}

-	models, err := client.List(context.Background())
-	if err != nil {
-		return err
-	}
-
-	canonicalModelPath := server.ParseModelPath(args[0])
-	for _, model := range models.Models {
-		if model.Name == canonicalModelPath.GetShortTagname() {
-			return RunGenerate(cmd, args)
+	name := args[0]
+	// check if the model exists on the server
+	_, err = client.Show(context.Background(), &api.ShowRequest{Name: name})
+	var statusError api.StatusError
+	switch {
+	case errors.As(err, &statusError) && statusError.StatusCode == http.StatusNotFound:
+		if err := PullHandler(cmd, args); err != nil {
+			return err
 		}
-	}
-
-	if err := PullHandler(cmd, args); err != nil {
+	case err != nil:
 		return err
 	}

@@ -196,7 +172,7 @@ func ListHandler(cmd *cobra.Command, args []string) error {

 	for _, m := range models.Models {
 		if len(args) == 0 || strings.HasPrefix(m.Name, args[0]) {
-			data = append(data, []string{m.Name, m.Digest[:12], humanize.Bytes(uint64(m.Size)), format.HumanTime(m.ModifiedAt, "Never")})
+			data = append(data, []string{m.Name, m.Digest[:12], format.HumanBytes(m.Size), format.HumanTime(m.ModifiedAt, "Never")})
 		}
 	}

@@ -372,34 +348,49 @@ func pull(model string, insecure bool) error {
 }

 func RunGenerate(cmd *cobra.Command, args []string) error {
-	if len(args) > 1 {
-		// join all args into a single prompt
-		wordWrap := false
-		if term.IsTerminal(int(os.Stdout.Fd())) {
-			wordWrap = true
-		}
+	format, err := cmd.Flags().GetString("format")
+	if err != nil {
+		return err
+	}

-		nowrap, err := cmd.Flags().GetBool("nowordwrap")
+	prompts := args[1:]
+
+	// prepend stdin to the prompt if provided
+	if !term.IsTerminal(int(os.Stdin.Fd())) {
+		in, err := io.ReadAll(os.Stdin)
 		if err != nil {
 			return err
 		}
-		if nowrap {
-			wordWrap = false
-		}

-		return generate(cmd, args[0], strings.Join(args[1:], " "), wordWrap)
+		prompts = append([]string{string(in)}, prompts...)
 	}

-	if readline.IsTerminal(int(os.Stdin.Fd())) {
-		return generateInteractive(cmd, args[0])
+	// output is being piped
+	if !term.IsTerminal(int(os.Stdout.Fd())) {
+		return generate(cmd, args[0], strings.Join(prompts, " "), false, format)
 	}

-	return generateBatch(cmd, args[0])
+	wordWrap := os.Getenv("TERM") == "xterm-256color"
+
+	nowrap, err := cmd.Flags().GetBool("nowordwrap")
+	if err != nil {
+		return err
+	}
+	if nowrap {
+		wordWrap = false
+	}
+
+	// prompts are provided via stdin or args so don't enter interactive mode
+	if len(prompts) > 0 {
+		return generate(cmd, args[0], strings.Join(prompts, " "), wordWrap, format)
+	}
+
+	return generateInteractive(cmd, args[0], wordWrap, format)
 }

 type generateContextKey string

-func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {
+func generate(cmd *cobra.Command, model, prompt string, wordWrap bool, format string) error {
 	client, err := api.ClientFromEnvironment()
 	if err != nil {
 		return err
@@ -415,7 +406,7 @@ func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {
 		generateContext = []int{}
 	}

-	termWidth, _, err := term.GetSize(int(0))
+	termWidth, _, err := term.GetSize(int(os.Stdout.Fd()))
 	if err != nil {
 		wordWrap = false
 	}
@@ -436,7 +427,7 @@ func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {
 	var currentLineLength int
 	var wordBuffer string

-	request := api.GenerateRequest{Model: model, Prompt: prompt, Context: generateContext}
+	request := api.GenerateRequest{Model: model, Prompt: prompt, Context: generateContext, Format: format}
 	fn := func(response api.GenerateResponse) error {
 		if !spinner.IsFinished() {
 			spinner.Finish()
@@ -507,39 +498,12 @@ func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {
 	return nil
 }

-func generateInteractive(cmd *cobra.Command, model string) error {
-	home, err := os.UserHomeDir()
-	if err != nil {
-		return err
-	}
-
+func generateInteractive(cmd *cobra.Command, model string, wordWrap bool, format string) error {
 	// load the model
-	if err := generate(cmd, model, "", false); err != nil {
+	if err := generate(cmd, model, "", false, ""); err != nil {
 		return err
 	}

-	completer := readline.NewPrefixCompleter(
-		readline.PcItem("/help"),
-		readline.PcItem("/list"),
-		readline.PcItem("/set",
-			readline.PcItem("history"),
-			readline.PcItem("nohistory"),
-			readline.PcItem("wordwrap"),
-			readline.PcItem("nowordwrap"),
-			readline.PcItem("verbose"),
-			readline.PcItem("quiet"),
-		),
-		readline.PcItem("/show",
-			readline.PcItem("license"),
-			readline.PcItem("modelfile"),
-			readline.PcItem("parameters"),
-			readline.PcItem("system"),
-			readline.PcItem("template"),
-		),
-		readline.PcItem("/exit"),
-		readline.PcItem("/bye"),
-	)
-
 	usage := func() {
 		fmt.Fprintln(os.Stderr, "Available Commands:")
 		fmt.Fprintln(os.Stderr, "  /set         Set session variables")
@@ -557,6 +521,8 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 		fmt.Fprintln(os.Stderr, "  /set nohistory    Disable history")
 		fmt.Fprintln(os.Stderr, "  /set wordwrap     Enable wordwrap")
 		fmt.Fprintln(os.Stderr, "  /set nowordwrap   Disable wordwrap")
+		fmt.Fprintln(os.Stderr, "  /set format json  Enable JSON mode")
+		fmt.Fprintln(os.Stderr, "  /set noformat     Disable formatting")
 		fmt.Fprintln(os.Stderr, "  /set verbose      Show LLM stats")
 		fmt.Fprintln(os.Stderr, "  /set quiet        Disable LLM stats")
 		fmt.Fprintln(os.Stderr, "")
@@ -572,47 +538,32 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 		fmt.Fprintln(os.Stderr, "")
 	}

-	var painter Painter
-
-	config := readline.Config{
-		Painter:      &painter,
-		Prompt:       ">>> ",
-		HistoryFile:  filepath.Join(home, ".ollama", "history"),
-		AutoComplete: completer,
+	prompt := readline.Prompt{
+		Prompt:         ">>> ",
+		AltPrompt:      "... ",
+		Placeholder:    "Send a message (/? for help)",
+		AltPlaceholder: `Use """ to end multi-line input`,
 	}

-	scanner, err := readline.NewEx(&config)
+	scanner, err := readline.New(prompt)
 	if err != nil {
 		return err
 	}
-	defer scanner.Close()

-	var wordWrap bool
-	termType := os.Getenv("TERM")
-	if termType == "xterm-256color" {
-		wordWrap = true
-	}
-
-	// override wrapping if the user turned it off
-	nowrap, err := cmd.Flags().GetBool("nowordwrap")
-	if err != nil {
-		return err
-	}
-	if nowrap {
-		wordWrap = false
-	}
+	fmt.Print(readline.StartBracketedPaste)
+	defer fmt.Printf(readline.EndBracketedPaste)

 	var multiLineBuffer string
-	var isMultiLine bool

 	for {
 		line, err := scanner.Readline()
 		switch {
 		case errors.Is(err, io.EOF):
+			fmt.Println()
 			return nil
 		case errors.Is(err, readline.ErrInterrupt):
 			if line == "" {
-				fmt.Println("Use Ctrl-D or /bye to exit.")
+				fmt.Println("\nUse Ctrl-D or /bye to exit.")
 			}

 			continue
@@ -623,23 +574,19 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 		line = strings.TrimSpace(line)

 		switch {
-		case isMultiLine:
+		case scanner.Prompt.UseAlt:
 			if strings.HasSuffix(line, `"""`) {
-				isMultiLine = false
-				painter.IsMultiLine = isMultiLine
+				scanner.Prompt.UseAlt = false
 				multiLineBuffer += strings.TrimSuffix(line, `"""`)
 				line = multiLineBuffer
 				multiLineBuffer = ""
-				scanner.SetPrompt(">>> ")
 			} else {
 				multiLineBuffer += line + " "
 				continue
 			}
 		case strings.HasPrefix(line, `"""`):
-			isMultiLine = true
-			painter.IsMultiLine = isMultiLine
+			scanner.Prompt.UseAlt = true
 			multiLineBuffer = strings.TrimPrefix(line, `"""`) + " "
-			scanner.SetPrompt("... ")
 			continue
 		case strings.HasPrefix(line, "/list"):
 			args := strings.Fields(line)
@@ -666,19 +613,16 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 				case "quiet":
 					cmd.Flags().Set("verbose", "false")
 					fmt.Println("Set 'quiet' mode.")
-				case "mode":
-					if len(args) > 2 {
-						switch args[2] {
-						case "vim":
-							scanner.SetVimMode(true)
-						case "emacs", "default":
-							scanner.SetVimMode(false)
-						default:
-							usage()
-						}
+				case "format":
+					if len(args) < 3 || args[2] != "json" {
+						fmt.Println("Invalid or missing format. For 'json' mode use '/set format json'")
 					} else {
-						usage()
+						format = args[2]
+						fmt.Printf("Set format to '%s' mode.\n", args[2])
 					}
+				case "noformat":
+					format = ""
+					fmt.Println("Disabled format.")
 				default:
 					fmt.Printf("Unknown command '/set %s'. Type /? for help\n", args[1])
 				}
@@ -752,26 +696,13 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 		}

 		if len(line) > 0 && line[0] != '/' {
-			if err := generate(cmd, model, line, wordWrap); err != nil {
+			if err := generate(cmd, model, line, wordWrap, format); err != nil {
 				return err
 			}
 		}
 	}
 }

-func generateBatch(cmd *cobra.Command, model string) error {
-	scanner := bufio.NewScanner(os.Stdin)
-	for scanner.Scan() {
-		prompt := scanner.Text()
-		fmt.Printf(">>> %s\n", prompt)
-		if err := generate(cmd, model, prompt, false); err != nil {
-			return err
-		}
-	}
-
-	return nil
-}
-
 func RunServer(cmd *cobra.Command, _ []string) error {
 	host, port, err := net.SplitHostPort(os.Getenv("OLLAMA_HOST"))
 	if err != nil {
@@ -795,21 +726,6 @@ func RunServer(cmd *cobra.Command, _ []string) error {
 		origins = strings.Split(o, ",")
 	}

-	if noprune := os.Getenv("OLLAMA_NOPRUNE"); noprune == "" {
-		if err := server.PruneLayers(); err != nil {
-			return err
-		}
-
-		manifestsPath, err := server.GetManifestPath()
-		if err != nil {
-			return err
-		}
-
-		if err := server.PruneDirectory(manifestsPath); err != nil {
-			return err
-		}
-	}
-
 	return server.Serve(ln, origins)
 }

@@ -964,6 +880,7 @@ func NewCLI() *cobra.Command {
 	runCmd.Flags().Bool("verbose", false, "Show timings for response")
 	runCmd.Flags().Bool("insecure", false, "Use an insecure registry")
 	runCmd.Flags().Bool("nowordwrap", false, "Don't wrap words to the next line automatically")
+	runCmd.Flags().String("format", "", "Response format (e.g. json)")

 	serveCmd := &cobra.Command{
 		Use:     "serve",
--- a/docs/api.md
+++ b/docs/api.md
@@ -41,28 +41,36 @@ Generate a response for a given prompt with a provided model. This is a streamin

 Advanced parameters (optional):

+- `format`: the format to return a response in. Currently the only accepted value is `json`
 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
 - `system`: system prompt to (overrides what is defined in the `Modelfile`)
 - `template`: the full prompt or prompt template (overrides what is defined in the `Modelfile`)
 - `context`: the context parameter returned from a previous request to `/generate`, this can be used to keep a short conversational memory
- `stream`: if `false` the response will be be returned as a single response object, rather than a stream of objects
+- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
+- `raw`: if `true` no formatting will be applied to the prompt and no context will be returned. You may choose to use the `raw` parameter if you are specifying a full templated prompt in your request to the API, and are managing history yourself.

-### Request
+### JSON mode
+
+Enable JSON mode by setting the `format` parameter to `json` and specifying the model should use JSON in the `prompt`. This will structure the response as valid JSON. See the JSON mode [example](#request-json-mode) below.
+
+### Examples
+
+#### Request

 ```shell
 curl -X POST http://localhost:11434/api/generate -d '{
-  "model": "llama2:7b",
+  "model": "llama2",
  "prompt": "Why is the sky blue?"
 }'
 ```

-### Response
+#### Response

-A stream of JSON objects:
+A stream of JSON objects is returned:

 ```json
 {
-  "model": "llama2:7b",
+  "model": "llama2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "response": "The",
  "done": false
@@ -86,7 +94,7 @@ To calculate how fast the response is generated in tokens per second (token/s),

 ```json
 {
-  "model": "llama2:7b",
+  "model": "llama2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "response": "",
  "context": [1, 2, 3],
@@ -102,6 +110,182 @@ To calculate how fast the response is generated in tokens per second (token/s),
 }
 ```

+#### Request (No streaming)
+
+```shell
+curl -X POST http://localhost:11434/api/generate -d '{
+  "model": "llama2:7b",
+  "prompt": "Why is the sky blue?",
+  "stream": false
+}'
+```
+
+#### Response
+
+If `stream` is set to `false`, the response will be a single JSON object:
+
+```json
+{
+  "model": "llama2:7b",
+  "created_at": "2023-08-04T19:22:45.499127Z",
+  "response": "The sky is blue because it is the color of the sky.",
+  "context": [1, 2, 3],
+  "done": true,
+  "total_duration": 5589157167,
+  "load_duration": 3013701500,
+  "sample_count": 114,
+  "sample_duration": 81442000,
+  "prompt_eval_count": 46,
+  "prompt_eval_duration": 1160282000,
+  "eval_count": 13,
+  "eval_duration": 1325948000
+}
+```
+
+#### Request (Raw mode)
+
+In some cases you may wish to bypass the templating system and provide a full prompt. In this case, you can use the `raw` parameter to disable formatting and context.
+
+```shell
+curl -X POST http://localhost:11434/api/generate -d '{
+  "model": "mistral",
+  "prompt": "[INST] why is the sky blue? [/INST]",
+  "raw": true,
+  "stream": false
+}'
+```
+
+#### Response
+
+```json
+{
+  "model": "mistral",
+  "created_at": "2023-11-03T15:36:02.583064Z",
+  "response": " The sky appears blue because of a phenomenon called Rayleigh scattering.",
+  "done": true,
+  "total_duration": 14648695333,
+  "load_duration": 3302671417,
+  "prompt_eval_count": 14,
+  "prompt_eval_duration": 286243000,
+  "eval_count": 129,
+  "eval_duration": 10931424000
+}
+```
+
+#### Request (JSON mode)
+
+```shell
+curl -X POST http://localhost:11434/api/generate -d '{
+  "model": "llama2",
+  "prompt": "What color is the sky at different times of the day? Respond using JSON",
+  "format": "json",
+  "stream": false
+}'
+```
+
+#### Response
+
+```json
+{
+  "model": "llama2",
+  "created_at": "2023-11-09T21:07:55.186497Z",
+  "response": "{\n\"morning\": {\n\"color\": \"blue\"\n},\n\"noon\": {\n\"color\": \"blue-gray\"\n},\n\"afternoon\": {\n\"color\": \"warm gray\"\n},\n\"evening\": {\n\"color\": \"orange\"\n}\n}\n",
+  "done": true,
+  "total_duration": 4661289125,
+  "load_duration": 1714434500,
+  "prompt_eval_count": 36,
+  "prompt_eval_duration": 264132000,
+  "eval_count": 75,
+  "eval_duration": 2112149000
+}
+```
+
+The value of `response` will be a string containing JSON similar to:
+
+```json
+{
+  "morning": {
+    "color": "blue"
+  },
+  "noon": {
+    "color": "blue-gray"
+  },
+  "afternoon": {
+    "color": "warm gray"
+  },
+  "evening": {
+    "color": "orange"
+  }
+}
+```
+
+#### Request (With options)
+
+If you want to set custom options for the model at runtime rather than in the Modelfile, you can do so with the `options` parameter. This example sets every available option, but you can set any of them individually and omit the ones you do not want to override.
+
+```shell
+curl -X POST http://localhost:11434/api/generate -d '{
+  "model": "llama2:7b",
+  "prompt": "Why is the sky blue?",
+  "stream": false,
+  "options": {
+    "num_keep": 5,
+    "seed": 42,
+    "num_predict": 100,
+    "top_k": 20,
+    "top_p": 0.9,
+    "tfs_z": 0.5,
+    "typical_p": 0.7,
+    "repeat_last_n": 33,
+    "temperature": 0.8,
+    "repeat_penalty": 1.2,
+    "presence_penalty": 1.5,
+    "frequency_penalty": 1.0,
+    "mirostat": 1,
+    "mirostat_tau": 0.8,
+    "mirostat_eta": 0.6,
+    "penalize_newline": true,
+    "stop": ["\n", "user:"],
+    "numa": false,
+    "num_ctx": 4,
+    "num_batch": 2,
+    "num_gqa": 1,
+    "num_gpu": 1,
+    "main_gpu": 0,
+    "low_vram": false,
+    "f16_kv": true,
+    "logits_all": false,
+    "vocab_only": false,
+    "use_mmap": true,
+    "use_mlock": false,
+    "embedding_only": false,
+    "rope_frequency_base": 1.1,
+    "rope_frequency_scale": 0.8,
+    "num_thread": 8
+    }
+}'
+```
+
+#### Response
+
+```json
+{
+  "model": "llama2:7b",
+  "created_at": "2023-08-04T19:22:45.499127Z",
+  "response": "The sky is blue because it is the color of the sky.",
+  "context": [1, 2, 3],
+  "done": true,
+  "total_duration": 5589157167,
+  "load_duration": 3013701500,
+  "sample_count": 114,
+  "sample_duration": 81442000,
+  "prompt_eval_count": 46,
+  "prompt_eval_duration": 1160282000,
+  "eval_count": 13,
+  "eval_duration": 1325948000
+}
+```
+
 ## Create a Model

 ```shell
@@ -114,9 +298,11 @@ Create a model from a [`Modelfile`](./modelfile.md)

 - `name`: name of the model to create
 - `path`: path to the Modelfile
- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects
+- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects

-### Request
+### Examples
+
+#### Request

 ```shell
 curl -X POST http://localhost:11434/api/create -d '{
@@ -125,7 +311,7 @@ curl -X POST http://localhost:11434/api/create -d '{
 }'
 ```

-### Response
+#### Response

 A stream of JSON objects. When finished, `status` is `success`.

@@ -143,13 +329,17 @@ GET /api/tags

 List models that are available locally.

-### Request
+### Examples
+
+#### Request

 ```shell
 curl http://localhost:11434/api/tags
 ```

-### Response
+#### Response
+
+A single JSON object will be returned.

 ```json
 {
@@ -180,7 +370,9 @@ Show details about a model including modelfile, template, parameters, license, a

 - `name`: name of the model to show

-### Request
+### Examples
+
+#### Request

 ```shell
 curl http://localhost:11434/api/show -d '{
@@ -188,14 +380,14 @@ curl http://localhost:11434/api/show -d '{
 }'
 ```

-### Response
+#### Response

 ```json
 {
  "license": "<contents of license block>",
-  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llama2:latest\n\nFROM /Users/username/.ollama/models/blobs/sha256:8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8\nTEMPLATE \"\"\"[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] \"\"\"\nSYSTEM \"\"\"\"\"\"\nPARAMETER stop [INST]\nPARAMETER stop [/INST]\nPARAMETER stop <<SYS>>\nPARAMETER stop <</SYS>>\n",
+  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llama2:latest\n\nFROM /Users/username/.ollama/models/blobs/sha256:8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8\nTEMPLATE \"\"\"[INST] <<SYS>>{{ .System }}<</SYS>>\n\n{{ .Prompt }} [/INST] \"\"\"\nSYSTEM \"\"\"\"\"\"\nPARAMETER stop [INST]\nPARAMETER stop [/INST]\nPARAMETER stop <<SYS>>\nPARAMETER stop <</SYS>>\n",
  "parameters": "stop                           [INST]\nstop                           [/INST]\nstop                           <<SYS>>\nstop                           <</SYS>>",
-  "template": "[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] "
+  "template": "[INST] <<SYS>>{{ .System }}<</SYS>>\n\n{{ .Prompt }} [/INST] "
 }
 ```

@@ -207,7 +399,9 @@ POST /api/copy

 Copy a model. Creates a model with another name from an existing model.

-### Request
+### Examples
+
+#### Request

 ```shell
 curl http://localhost:11434/api/copy -d '{
@@ -216,6 +410,10 @@ curl http://localhost:11434/api/copy -d '{
 }'
 ```

+#### Response
+
+The only response is a 200 OK if successful.
+
 ## Delete a Model

 ```shell
@@ -226,9 +424,11 @@ Delete a model and its data.

 ### Parameters

- `model`: model name to delete
+- `name`: model name to delete

-### Request
+### Examples
+
+#### Request

 ```shell
 curl -X DELETE http://localhost:11434/api/delete -d '{
@@ -236,6 +436,10 @@ curl -X DELETE http://localhost:11434/api/delete -d '{
 }'
 ```

+#### Response
+
+If successful, the only response is a 200 OK.
+
 ## Pull a Model

 ```shell
@@ -248,9 +452,11 @@ Download a model from the ollama library. Cancelled pulls are resumed from where

 - `name`: name of the model to pull
 - `insecure`: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.
- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects
+- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects

-### Request
+### Examples
+
+#### Request

 ```shell
 curl -X POST http://localhost:11434/api/pull -d '{
@@ -258,13 +464,51 @@ curl -X POST http://localhost:11434/api/pull -d '{
 }'
 ```

-### Response
+#### Response
+
+If `stream` is not specified, or set to `true`, a stream of JSON objects is returned:
+
+The first object is the manifest:
+
+```json
+{
+  "status": "pulling manifest"
+}
+```
+
+Then there is a series of downloading responses. Until any of the download is completed, the `completed` key may not be included. The number of files to be downloaded depends on the number of layers specified in the manifest.

 ```json
 {
  "status": "downloading digestname",
  "digest": "digestname",
-  "total": 2142590208
+  "total": 2142590208,
+  "completed": 241970
+}
+```
+
+After all the files are downloaded, the final responses are:
+
+```json
+{
+    "status": "verifying sha256 digest"
+}
+{
+    "status": "writing manifest"
+}
+{
+    "status": "removing any unused layers"
+}
+{
+    "status": "success"
+}
+```
+
+if `stream` is set to false, then the response is a single JSON object:
+
+```json
+{
+  "status": "success"
 }
 ```

@@ -280,9 +524,11 @@ Upload a model to a model library. Requires registering for ollama.ai and adding

 - `name`: name of the model to push in the form of `<namespace>/<model>:<tag>`
 - `insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development.
- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects
+- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects

-### Request
+### Examples
+
+#### Request

 ```shell
 curl -X POST http://localhost:11434/api/push -d '{
@@ -290,9 +536,9 @@ curl -X POST http://localhost:11434/api/push -d '{
 }'
 ```

-### Response
+#### Response

-Streaming response that starts with:
+If `stream` is not specified, or set to `true`, a stream of JSON objects is returned:

 ```json
 { "status": "retrieving manifest" }
@@ -325,6 +571,12 @@ Finally, when the upload is complete:
 {"status":"success"}
 ```

+If `stream` is set to `false`, then the response is a single JSON object:
+
+```json
+{ "status": "success" }
+```
+
 ## Generate Embeddings

 ```shell
@@ -342,7 +594,9 @@ Advanced parameters:

 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`

-### Request
+### Examples
+
+#### Request

 ```shell
 curl -X POST http://localhost:11434/api/embeddings -d '{
@@ -351,7 +605,7 @@ curl -X POST http://localhost:11434/api/embeddings -d '{
 }'
 ```

-### Response
+#### Response

 ```json
 {
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -16,19 +16,83 @@ journalctl -u ollama

 If you're running `ollama serve` directly, the logs will be printed to the console.

-## How can I expose the Ollama server?
+## How can I expose Ollama on my network?
+
+Ollama binds to 127.0.0.1 port 11434 by default. Change the bind address with the `OLLAMA_HOST` environment variable.
+
+On macOS:

 ```bash
 OLLAMA_HOST=0.0.0.0:11435 ollama serve
 ```

-By default, Ollama allows cross origin requests from `127.0.0.1` and `0.0.0.0`. To support more origins, you can use the `OLLAMA_ORIGINS` environment variable:
+On Linux:
+
+Create a `systemd` drop-in directory and set `Environment=OLLAMA_HOST`
+
+```bash
+mkdir -p /etc/systemd/system/ollama.service.d
+echo "[Service]" >>/etc/systemd/system/ollama.service.d/environment.conf
+```
+
+```bash
+echo "Environment=OLLAMA_HOST=0.0.0.0:11434" >>/etc/systemd/system/ollama.service.d/environment.conf
+```
+
+Reload `systemd` and restart Ollama:
+
+```bash
+systemctl daemon-reload
+systemctl restart ollama
+```
+
+## How can I allow additional web origins to access Ollama?
+
+Ollama allows cross origin requests from `127.0.0.1` and `0.0.0.0` by default. Add additional origins with the `OLLAMA_ORIGINS` environment variable:
+
+On macOS:

 ```bash
 OLLAMA_ORIGINS=http://192.168.1.1:*,https://example.com ollama serve
 ```

+On Linux:
+
+```bash
+echo "Environment=OLLAMA_ORIGINS=http://129.168.1.1:*,https://example.com" >>/etc/systemd/system/ollama.service.d/environment.conf
+```
+
+Reload `systemd` and restart Ollama:
+
+```bash
+systemctl daemon-reload
+systemctl restart ollama
+```
+
 ## Where are models stored?

 - macOS: Raw model data is stored under `~/.ollama/models`.
 - Linux: Raw model data is stored under `/usr/share/ollama/.ollama/models`
+
+
+
+Below the models directory you will find a structure similar to the following:
+
+```shell
+.
+├── blobs
+└── manifests
+   └── registry.ollama.ai
+      ├── f0rodo
+      ├── library
+      ├── mattw
+      └── saikatkumardey
+```
+
+There is a `manifests/registry.ollama.ai/namespace` path. In example above, the user has downloaded models from the official `library`, `f0rodo`, `mattw`, and `saikatkumardey` namespaces. Within each of those directories, you will find directories for each of the models downloaded. And in there you will find a file name representing each tag. Each tag file is the manifest for the model.  
+
+The manifest lists all the layers used in this model. You will see a `media type` for each layer, along with a digest. That digest corresponds with a file in the `models/blobs directory`.
+
+### How can I change where Ollama stores models?
+
+To modify where models are stored, you can use the `OLLAMA_MODELS` environment variable. Note that on Linux this means defining `OLLAMA_MODELS` in a drop-in `/etc/systemd/system/ollama.service.d` service file, reloading systemd, and restarting the ollama service.
--- a/docs/import.md
+++ b/docs/import.md
@@ -1,8 +1,43 @@
 # Import a model

-This guide walks through importing a PyTorch, Safetensors or GGUF model.
+This guide walks through importing a GGUF, PyTorch or Safetensors model.

-## Supported models
+## Importing (GGUF)
+
+### Step 1: Write a `Modelfile`
+
+Start by creating a `Modelfile`. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.
+
+```
+FROM ./mistral-7b-v0.1.Q4_0.gguf
+```
+
+(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
+
+```
+FROM ./q4_0.bin
+TEMPLATE "[INST] {{ .Prompt }} [/INST]"
+```
+
+### Step 2: Create the Ollama model
+
+Finally, create a model from your `Modelfile`:
+
+```
+ollama create example -f Modelfile
+```
+
+### Step 3: Run your model
+
+Next, test the model with `ollama run`:
+
+```
+ollama run example "What is your favourite condiment?"
+```
+
+## Importing (PyTorch & Safetensors)
+
+### Supported models

 Ollama supports a set of model architectures, with support for more coming soon:

@@ -13,8 +48,6 @@ Ollama supports a set of model architectures, with support for more coming soon:

 To view a model's architecture, check the `config.json` file in its HuggingFace repo. You should see an entry under `architectures` (e.g. `LlamaForCausalLM`).

-## Importing
-
 ### Step 1: Clone the HuggingFace repository (optional)

 If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
@@ -44,7 +77,7 @@ This will output two files into the directory:

 ### Step 3: Write a `Modelfile`

-Next, create a `Modelfile` for your model. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.
+Next, create a `Modelfile` for your model:

 ```
 FROM ./q4_0.bin
@@ -65,13 +98,15 @@ Finally, create a model from your `Modelfile`:
 ollama create example -f Modelfile
 ```

+### Step 5: Run your model
+
 Next, test the model with `ollama run`:

 ```
 ollama run example "What is your favourite condiment?"
 ```

-### Step 5: Publish your model (optional – early alpha)
+## Publishing your model (optional – early alpha)

 Publishing models is in early alpha. If you'd like to publish your model to share with others, follow these steps:

@@ -150,7 +185,7 @@ python convert.py <path to model directory>
 python convert-falcon-hf-to-gguf.py <path to model directory>

 # GPTNeoXForCausalLM
-python convert-falcon-hf-to-gguf.py <path to model directory>
+python convert-gptneox-hf-to-gguf.py <path to model directory>

 # GPTBigCodeForCausalLM
 python convert-starcoder-hf-to-gguf.py <path to model directory>
--- a/docs/linux.md
+++ b/docs/linux.md
@@ -1,12 +1,16 @@
-# Installing Ollama on Linux
+# Ollama on Linux

-> Note: A one line installer for Ollama is available by running:
+## Install
+
+Install Ollama running this one-liner:
 >
-> ```bash
-> curl https://ollama.ai/install.sh | sh
-> ```
+```bash
+curl https://ollama.ai/install.sh | sh
+```

-## Download the `ollama` binary
+## Manual install
+
+### Download the `ollama` binary

 Ollama is distributed as a self-contained binary. Download it to a directory in your PATH:

@@ -15,31 +19,7 @@ sudo curl -L https://ollama.ai/download/ollama-linux-amd64 -o /usr/bin/ollama
 sudo chmod +x /usr/bin/ollama
 ```

-## Start Ollama
-
-Start Ollama by running `ollama serve`:
-
-```bash
-ollama serve
-```
-
-Once Ollama is running, run a model in another terminal session:
-
-```bash
-ollama run llama2
-```
-
-## Install CUDA drivers (optional – for Nvidia GPUs)
-
-[Download and install](https://developer.nvidia.com/cuda-downloads) CUDA.
-
-Verify that the drivers are installed by running the following command, which should print details about your GPU:
-
-```bash
-nvidia-smi
-```
-
-## Adding Ollama as a startup service (optional)
+### Adding Ollama as a startup service (recommended)

 Create a user for Ollama:

@@ -60,7 +40,6 @@ User=ollama
 Group=ollama
 Restart=always
 RestartSec=3
-Environment="HOME=/usr/share/ollama"

 [Install]
 WantedBy=default.target
@@ -73,10 +52,65 @@ sudo systemctl daemon-reload
 sudo systemctl enable ollama
 ```

-### Viewing logs
+### Install CUDA drivers (optional – for Nvidia GPUs)
+
+[Download and install](https://developer.nvidia.com/cuda-downloads) CUDA.
+
+Verify that the drivers are installed by running the following command, which should print details about your GPU:
+
+```bash
+nvidia-smi
+```
+
+### Start Ollama
+
+Start Ollama using `systemd`:
+
+```bash
+sudo systemctl start ollama
+```
+
+## Update
+
+Update ollama by running the install script again:
+
+```bash
+curl https://ollama.ai/install.sh | sh
+```
+
+Or by downloading the ollama binary:
+
+```bash
+sudo curl -L https://ollama.ai/download/ollama-linux-amd64 -o /usr/bin/ollama
+sudo chmod +x /usr/bin/ollama
+```
+
+## Viewing logs

 To view logs of Ollama running as a startup service, run:

 ```bash
 journalctl -u ollama
 ```
+
+## Uninstall
+
+Remove the ollama service:
+
+```bash
+sudo systemctl stop ollama
+sudo systemctl disable ollama
+sudo rm /etc/systemd/system/ollama.service
+```
+
+Remove the ollama binary from your bin directory (either `/usr/local/bin`, `/usr/bin`, or `/bin`):
+
+```bash
+sudo rm $(which ollama)
+```
+
+Remove the downloaded models and Ollama service user:
+```bash
+sudo rm -r /usr/share/ollama
+sudo userdel ollama
+```
--- a/docs/modelfile.md
+++ b/docs/modelfile.md
@@ -112,8 +112,8 @@ PARAMETER <parameter> <parametervalue>
 | repeat_last_n  | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)                                                                                                                                           | int        | repeat_last_n 64     |
 | repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)                                                                     | float      | repeat_penalty 1.1   |
 | temperature    | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)                                                                                                                                     | float      | temperature 0.7      |
-| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 |
-| stop           | Sets the stop sequences to use.                                                                                                                                                                                                                         | string     | stop "AI assistant:" |
+| seed           | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0)                                                                                       | int        | seed 42              |
+| stop           | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile.                                      | string     | stop "AI assistant:" |
 | tfs_z          | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1)                                               | float      | tfs_z 1              |
 | num_predict    | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)                                                                                                                                   | int        | num_predict 42       |
 | top_k          | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)                                                                        | int        | top_k 40             |
@@ -129,14 +129,11 @@ PARAMETER <parameter> <parametervalue>
 | --------------- | ------------------------------------------------------------------------------------------------------------ |
 | `{{ .System }}` | The system prompt used to specify custom behavior, this must also be set in the Modelfile as an instruction. |
 | `{{ .Prompt }}` | The incoming prompt, this is not specified in the model file and will be set based on input.                 |
-| `{{ .First }}`  | A boolean value used to render specific template information for the first generation of a session.          |

 ```modelfile
 TEMPLATE """
-{{- if .First }}
 ### System:
 {{ .System }}
-{{- end }}

 ### User:
 {{ .Prompt }}
--- a/docs/tutorials.md
+++ b/docs/tutorials.md
@@ -4,5 +4,6 @@ Here is a list of ways you can use Ollama with other tools to build interesting

 - [Using LangChain with Ollama in JavaScript](./tutorials/langchainjs.md)
 - [Using LangChain with Ollama in Python](./tutorials/langchainpy.md)
+- [Running Ollama on NVIDIA Jetson Devices](./tutorials/nvidia-jetson.md)

-Also be sure to check out the [examples](../examples) directory for more ways to use Ollama.
+Also be sure to check out the [examples](../examples) directory for more ways to use Ollama.
--- a/docs/tutorials/langchainjs.md
+++ b/docs/tutorials/langchainjs.md
@@ -23,13 +23,17 @@ const answer = await ollama.call(`why is the sky blue?`);
 console.log(answer);
 ```

-That will get us the same thing as if we ran `ollama run llama2 "why is the sky blue"` in the terminal. But we want to load a document from the web to ask a question against. **Cheerio** is a great library for ingesting a webpage, and **LangChain** uses it in their **CheerioWebBaseLoader**. So let's build that part of the app.
+That will get us the same thing as if we ran `ollama run llama2 "why is the sky blue"` in the terminal. But we want to load a document from the web to ask a question against. **Cheerio** is a great library for ingesting a webpage, and **LangChain** uses it in their **CheerioWebBaseLoader**. So let's install **Cheerio** and build that part of the app.
+
+```bash
+npm install cheerio 
+```

 ```javascript
 import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";

 const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/2023_Hawaii_wildfires");
-const data = loader.load();
+const data = await loader.load();
 ```

 That will load the document. Although this page is smaller than the Odyssey, it is certainly bigger than the context size for most LLMs. So we are going to need to split into smaller pieces, and then select just the pieces relevant to our question. This is a great use for a vector datastore. In this example, we will use the **MemoryVectorStore** that is part of **LangChain**. But there is one more thing we need to get the content into the datastore. We have to run an embeddings process that converts the tokens in the text into a series of vectors. And for that, we are going to use **Tensorflow**. There is a lot of stuff going on in this one. First, install the **Tensorflow** components that we need.
--- a/docs/tutorials/nvidia-jetson.md
+++ b/docs/tutorials/nvidia-jetson.md
@@ -0,0 +1,38 @@
+# Running Ollama on NVIDIA Jetson Devices
+
+With some minor configuration, Ollama runs well on [NVIDIA Jetson Devices](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/). The following has been tested on [JetPack 5.1.2](https://developer.nvidia.com/embedded/jetpack).
+
+NVIDIA Jetson devices are Linux-based embedded AI computers that are purpose-built for AI applications.
+
+Jetsons have an integrated GPU that is wired directly to the memory controller of the machine. For this reason, the `nvidia-smi` command is unrecognized, and Ollama proceeds to operate in "CPU only"
+mode. This can be verified by using a monitoring tool like jtop.
+
+In order to address this, we simply pass the path to the Jetson's pre-installed CUDA libraries into `ollama serve` (while in a tmux session). We then hardcode the num_gpu parameters into a cloned
+version of our target model.
+
+Prerequisites:
+
+- curl
+- tmux
+
+Here are the steps:
+
+- Install Ollama via standard Linux command (ignore the 404 error): `curl https://ollama.ai/install.sh | sh`
+- Stop the Ollama service: `sudo systemctl stop ollama`
+- Start Ollama serve in a tmux session called ollama_jetson and reference the CUDA libraries path: `tmux has-session -t ollama_jetson 2>/dev/null || tmux new-session -d -s ollama_jetson 
+'LD_LIBRARY_PATH=/usr/local/cuda/lib64 ollama serve'`
+- Pull the model you want to use (e.g. mistral): `ollama pull mistral`
+- Create a new Modelfile specifically for enabling GPU support on the Jetson: `touch ModelfileMistralJetson`
+- In the ModelfileMistralJetson file, specify the FROM model and the num_gpu PARAMETER as shown below:
+
+```
+FROM mistral
+PARAMETER num_gpu 999
+```
+
+- Create a new model from your Modelfile: `ollama create mistral-jetson -f ./ModelfileMistralJetson`
+- Run the new model: `ollama run mistral-jetson`
+
+If you run a monitoring tool like jtop you should now see that Ollama is using the Jetson's integrated GPU.
+
+And that's it!
--- a/examples/bash-comparemodels/README.md
+++ b/examples/bash-comparemodels/README.md
@@ -0,0 +1,10 @@
+# Bash Shell examples
+
+When calling `ollama`, you can pass it a file to run all the prompts in the file, one after the other:
+
+`ollama run llama2 < sourcequestions.txt`
+
+This concept is used in the following example.
+
+## Compare Models
+`comparemodels.sh` is a script that runs all the questions in `sourcequestions.txt` using any 4 models you choose that you have already pulled from the Ollama library or have created locally.
--- a/examples/bash-comparemodels/comparemodels.sh
+++ b/examples/bash-comparemodels/comparemodels.sh
@@ -0,0 +1,64 @@
+#! /usr/bin/env bash
+# Compare multiple models by running them with the same questions
+
+NUMBEROFCHOICES=4
+SELECTIONS=()
+declare -a SUMS=()
+
+# Get the list of models
+CHOICES=$(ollama list | awk '{print $1}')
+
+# Select which models to run as a comparison
+echo "Select $NUMBEROFCHOICES models to compare:"
+select ITEM in $CHOICES; do
+    if [[ -n $ITEM ]]; then
+        echo "You have selected $ITEM"
+        SELECTIONS+=("$ITEM")
+        ((COUNT++))
+        if [[ $COUNT -eq $NUMBEROFCHOICES ]]; then
+            break
+        fi
+    else
+        echo "Invalid selection"
+    fi
+done
+
+# Loop through each of the selected models
+for ITEM in "${SELECTIONS[@]}"; do
+    echo "--------------------------------------------------------------"
+    echo "Loading the model $ITEM into memory"
+    ollama run "$ITEM" ""
+    echo "--------------------------------------------------------------"
+    echo "Running the questions through the model $ITEM"
+    COMMAND_OUTPUT=$(ollama run "$ITEM" --verbose < sourcequestions.txt 2>&1| tee /dev/stderr)
+
+    # eval duration is sometimes listed in seconds and sometimes in milliseconds. 
+    # Add up the values for each model
+    SUM=$(echo "$COMMAND_OUTPUT" | awk '
+    /eval duration:/ {
+        value = $3
+        if (index(value, "ms") > 0) {
+            gsub("ms", "", value)
+            value /= 1000
+        } else {
+            gsub("s", "", value)
+        }
+        sum += value
+    }
+    END { print sum }')
+
+
+    SUMS+=("All questions for $ITEM completed in $SUM seconds")
+done
+
+echo ""
+echo "--------------------------------------------------------------"
+echo -e "Sums of eval durations for each run:"
+for val in "${SUMS[@]}"; do
+    echo "$val"
+done
+
+echo "--------------------------------------------------------------"
+echo "Comparison complete. Now you can decide"
+echo "which model is best."
+echo "--------------------------------------------------------------"
--- a/examples/bash-comparemodels/sourcequestions.txt
+++ b/examples/bash-comparemodels/sourcequestions.txt
@@ -0,0 +1,7 @@
+Why is the sky blue
+What is a black hole
+Explain the big bang theory like I am 5?
+What is the quickest way to win a game of Monopoly with 3 others?
+Why does a vacuum bottle keep my coffee hot and my milkshake cold?
+What is the difference between a meteor, a meteorite, and a meteoroid?
+Create an array with 5 items and print to the console. Do this in Python, C#, Typescript, and Rust.
--- a/examples/kubernetes/README.md
+++ b/examples/kubernetes/README.md
@@ -0,0 +1,36 @@
+# Deploy Ollama to Kubernetes
+
+## Prerequisites
+
+- Ollama: https://ollama.ai/download
+- Kubernetes cluster. This example will use Google Kubernetes Engine.
+
+## Steps
+
+1. Create the Ollama namespace, daemon set, and service
+
+    ```bash
+    kubectl apply -f cpu.yaml
+    ```
+
+1. Port forward the Ollama service to connect and use it locally
+
+    ```bash
+    kubectl -n ollama port-forward service/ollama 11434:80
+    ```
+
+1. Pull and run a model, for example `orca-mini:3b`
+
+    ```bash
+    ollama run orca-mini:3b
+    ```
+
+## (Optional) Hardware Acceleration
+
+Hardware acceleration in Kubernetes requires NVIDIA's [`k8s-device-plugin`](https://github.com/NVIDIA/k8s-device-plugin). Follow the link for more details.
+
+Once configured, create a GPU enabled Ollama deployment.
+
+```bash
+kubectl apply -f gpu.yaml
+```
--- a/examples/kubernetes/cpu.yaml
+++ b/examples/kubernetes/cpu.yaml
@@ -0,0 +1,42 @@
+---
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: ollama
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: ollama
+  namespace: ollama
+spec:
+  selector:
+    matchLabels:
+      name: ollama
+  template:
+    metadata:
+      labels:
+        name: ollama
+    spec:
+      containers:
+      - name: ollama
+        image: ollama/ollama:latest
+        ports:
+        - name: http
+          containerPort: 11434
+          protocol: TCP
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: ollama
+  namespace: ollama
+spec:
+  type: ClusterIP
+  selector:
+    name: ollama
+  ports:
+  - port: 80
+    name: http
+    targetPort: http
+    protocol: TCP
--- a/examples/kubernetes/gpu.yaml
+++ b/examples/kubernetes/gpu.yaml
@@ -0,0 +1,56 @@
+---
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: ollama
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: ollama
+  namespace: ollama
+spec:
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      name: ollama
+  template:
+    metadata:
+      labels:
+        name: ollama
+    spec:
+      containers:
+      - name: ollama
+        image: ollama/ollama:latest
+        env:
+        - name: PATH
+          value: /usr/local/nvidia/bin:/usr/local/nvidia/lib64:/usr/bin:/usr/sbin:/bin:/sbin
+        - name: LD_LIBRARY_PATH
+          value: /usr/local/nvidia/lib64
+        ports:
+        - name: http
+          containerPort: 11434
+          protocol: TCP
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+      tolerations:
+      - key: nvidia.com/gpu
+        operator: Exists
+        effect: NoSchedule
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: ollama
+  namespace: ollama
+spec:
+  type: ClusterIP
+  selector:
+    name: ollama
+  ports:
+  - port: 80
+    name: http
+    targetPort: http
+    protocol: TCP
--- a/examples/langchain-python-rag-privategpt/constants.py
+++ b/examples/langchain-python-rag-privategpt/constants.py
@@ -6,7 +6,6 @@ PERSIST_DIRECTORY = os.environ.get('PERSIST_DIRECTORY', 'db')

 # Define the Chroma settings
 CHROMA_SETTINGS = Settings(
-        chroma_db_impl='duckdb+parquet',
        persist_directory=PERSIST_DIRECTORY,
        anonymized_telemetry=False
 )
--- a/examples/langchain-python-rag-privategpt/ingest.py
+++ b/examples/langchain-python-rag-privategpt/ingest.py
@@ -150,7 +150,7 @@ def main():
        print("Creating new vectorstore")
        texts = process_documents()
        print(f"Creating embeddings. May take some minutes...")
-        db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS)
+        db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory)
    db.persist()
    db = None

--- a/examples/langchain-python-rag-privategpt/privateGPT.py
+++ b/examples/langchain-python-rag-privategpt/privateGPT.py
@@ -4,6 +4,7 @@ from langchain.embeddings import HuggingFaceEmbeddings
 from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
 from langchain.vectorstores import Chroma
 from langchain.llms import Ollama
+import chromadb
 import os
 import argparse
 import time
@@ -22,7 +23,9 @@ def main():
    # Parse the command line arguments
    args = parse_arguments()
    embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
-    db = Chroma(persist_directory=persist_directory, embedding_function=embeddings, client_settings=CHROMA_SETTINGS)
+
+    db = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
+
    retriever = db.as_retriever(search_kwargs={"k": target_source_chunks})
    # activate/deactivate the streaming StdOut callback for LLMs
    callbacks = [] if args.mute_stream else [StreamingStdOutCallbackHandler()]
--- a/examples/langchain-python-rag-privategpt/requirements.txt
+++ b/examples/langchain-python-rag-privategpt/requirements.txt
--- a/examples/modelfile-sentiments/Modelfile
+++ b/examples/modelfile-sentiments/Modelfile
@@ -3,10 +3,8 @@

 FROM orca
 TEMPLATE """
-{{- if .First }}
 ### System:
 {{ .System }}
-{{- end }}
 ### User: 
 I hate it when my phone dies
 ### Response: 
--- a/examples/modelfile-sentiments/Readme.md
+++ b/examples/modelfile-sentiments/Readme.md
@@ -3,10 +3,8 @@
 This is a simple sentiments analyzer using the Orca model. When you pull Orca from the registry, it has a Template already defined that looks like this:

 ```Modelfile
-{{- if .First }}
 ### System:
 {{ .System }}
-{{- end }}

 ### User:
 {{ .Prompt }}
--- a/examples/python-simplegenerate/client.py
+++ b/examples/python-simplegenerate/client.py
@@ -17,7 +17,7 @@ def generate(prompt, context):
    for line in r.iter_lines():
        body = json.loads(line)
        response_part = body.get('response', '')
-        # the response streams one token at a time, print that as we recieve it
+        # the response streams one token at a time, print that as we receive it
        print(response_part, end='', flush=True)

        if 'error' in body:
@@ -35,4 +35,4 @@ def main():
        print()

 if __name__ == "__main__":
-    main()
+    main()
--- a/format/bytes.go
+++ b/format/bytes.go
@@ -12,11 +12,11 @@ const (
 func HumanBytes(b int64) string {
 	switch {
 	case b > GigaByte:
-		return fmt.Sprintf("%d GB", b/GigaByte)
+		return fmt.Sprintf("%.1f GB", float64(b)/GigaByte)
 	case b > MegaByte:
-		return fmt.Sprintf("%d MB", b/MegaByte)
+		return fmt.Sprintf("%.1f MB", float64(b)/MegaByte)
 	case b > KiloByte:
-		return fmt.Sprintf("%d KB", b/KiloByte)
+		return fmt.Sprintf("%.1f KB", float64(b)/KiloByte)
 	default:
 		return fmt.Sprintf("%d B", b)
 	}
--- a/format/format.go
+++ b/format/format.go
@@ -0,0 +1,25 @@
+package format
+
+import (
+	"fmt"
+	"math"
+)
+
+const (
+	Thousand = 1000
+	Million  = Thousand * 1000
+	Billion  = Million * 1000
+)
+
+func HumanNumber(b uint64) string {
+	switch {
+	case b > Billion:
+		return fmt.Sprintf("%.0fB", math.Round(float64(b)/Billion))
+	case b > Million:
+		return fmt.Sprintf("%.0fM", math.Round(float64(b)/Million))
+	case b > Thousand:
+		return fmt.Sprintf("%.0fK", math.Round(float64(b)/Thousand))
+	default:
+		return fmt.Sprintf("%d", b)
+	}
+}
--- a/go.mod
+++ b/go.mod
@@ -3,12 +3,11 @@ module github.com/jmorganca/ollama
 go 1.20

 require (
-	github.com/dustin/go-humanize v1.0.1
+	github.com/emirpasic/gods v1.18.1
 	github.com/gin-gonic/gin v1.9.1
 	github.com/mattn/go-runewidth v0.0.14
 	github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db
 	github.com/olekukonko/tablewriter v0.0.5
-	github.com/pdevine/readline v1.5.2
 	github.com/spf13/cobra v1.7.0
 	golang.org/x/sync v0.3.0
 )
@@ -39,12 +38,12 @@ require (
 	github.com/twitchyliquid64/golang-asm v0.15.1 // indirect
 	github.com/ugorji/go/codec v1.2.11 // indirect
 	golang.org/x/arch v0.3.0 // indirect
-	golang.org/x/crypto v0.10.0
+	golang.org/x/crypto v0.14.0
 	golang.org/x/exp v0.0.0-20230817173708-d852ddb80c63
-	golang.org/x/net v0.10.0 // indirect
-	golang.org/x/sys v0.11.0 // indirect
-	golang.org/x/term v0.10.0
-	golang.org/x/text v0.10.0 // indirect
+	golang.org/x/net v0.17.0 // indirect
+	golang.org/x/sys v0.13.0 // indirect
+	golang.org/x/term v0.13.0
+	golang.org/x/text v0.13.0 // indirect
 	google.golang.org/protobuf v1.30.0 // indirect
 	gopkg.in/yaml.v3 v3.0.1 // indirect
 )
--- a/go.sum
+++ b/go.sum
@@ -4,17 +4,13 @@ github.com/bytedance/sonic v1.9.1/go.mod h1:i736AoUSYt75HyZLoJW9ERYxcy6eaN6h4BZX
 github.com/chenzhuoyu/base64x v0.0.0-20211019084208-fb5309c8db06/go.mod h1:DH46F32mSOjUmXrMHnKwZdA8wcEefY7UVqBKYGjpdQY=
 github.com/chenzhuoyu/base64x v0.0.0-20221115062448-fe3a3abad311 h1:qSGYFH7+jGhDF8vLC+iwCD4WpbV1EBDSzWkJODFLams=
 github.com/chenzhuoyu/base64x v0.0.0-20221115062448-fe3a3abad311/go.mod h1:b583jCggY9gE99b6G5LEC39OIiVsWj+R97kbl5odCEk=
-github.com/chzyer/logex v1.2.1 h1:XHDu3E6q+gdHgsdTPH6ImJMIp436vR6MPtH8gP05QzM=
-github.com/chzyer/logex v1.2.1/go.mod h1:JLbx6lG2kDbNRFnfkgvh4eRJRPX1QCoOIWomwysCBrQ=
-github.com/chzyer/test v1.0.0 h1:p3BQDXSxOhOG0P9z6/hGnII4LGiEPOYBhs8asl/fC04=
-github.com/chzyer/test v1.0.0/go.mod h1:2JlltgoNkt4TW/z9V/IzDdFaMTM2JPIi26O1pF38GC8=
 github.com/cpuguy83/go-md2man/v2 v2.0.2/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o=
 github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E=
 github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
-github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
-github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
+github.com/emirpasic/gods v1.18.1 h1:FXtiHYKDGKCW2KzwZKx0iC0PQmdlorYgdFG9jPXJ1Bc=
+github.com/emirpasic/gods v1.18.1/go.mod h1:8tpGGwCnJ5H4r6BWwaV6OrWmMoPhUl5jm/FMNAnJvWQ=
 github.com/gabriel-vasile/mimetype v1.4.2 h1:w5qFW6JKBz9Y393Y4q372O9A7cUSequkh1Q7OhCmWKU=
 github.com/gabriel-vasile/mimetype v1.4.2/go.mod h1:zApsH/mKG4w07erKIaJPFiX0Tsq9BFQgN3qGY5GnNgA=
 github.com/gin-contrib/cors v1.4.0 h1:oJ6gwtUl3lqV0WEIwM/LxPF1QZ5qe2lGWdY2+bz7y0g=
@@ -78,8 +74,6 @@ github.com/olekukonko/tablewriter v0.0.5 h1:P2Ga83D34wi1o9J6Wh1mRuqd4mF/x/lgBS7N
 github.com/olekukonko/tablewriter v0.0.5/go.mod h1:hPp6KlRPjbx+hW8ykQs1w3UBbZlj6HuIJcUGPhkA7kY=
 github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58 h1:onHthvaw9LFnH4t2DcNVpwGmV9E1BkGknEliJkfwQj0=
 github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58/go.mod h1:DXv8WO4yhMYhSNPKjeNKa5WY9YCIEBRbNzFFPJbWO6Y=
-github.com/pdevine/readline v1.5.2 h1:oz6Y5GdTmhPG+08hhxcAvtHitSANWuA2100Sppb38xI=
-github.com/pdevine/readline v1.5.2/go.mod h1:na/LbuE5PYwxI7GyopWdIs3U8HVe89lYlNTFTXH3wOw=
 github.com/pelletier/go-toml/v2 v2.0.1/go.mod h1:r9LEWfGN8R5k0VXJ+0BkIe7MYkRdwZOjgMj2KwnJFUo=
 github.com/pelletier/go-toml/v2 v2.0.8 h1:0ctb6s9mE31h0/lhu+J6OPmVeDxJn+kYnJc2jZR9tGQ=
 github.com/pelletier/go-toml/v2 v2.0.8/go.mod h1:vuYfssBdrU2XDZ9bYydBu6t+6a6PYNcZljzZR9VXg+4=
@@ -118,31 +112,30 @@ golang.org/x/arch v0.0.0-20210923205945-b76863e36670/go.mod h1:5om86z9Hs0C8fWVUu
 golang.org/x/arch v0.3.0 h1:02VY4/ZcO/gBOH6PUaoiptASxtXU10jazRCP865E97k=
 golang.org/x/arch v0.3.0/go.mod h1:5om86z9Hs0C8fWVUuoMHwpExlXzs5Tkyp9hOrfG7pp8=
 golang.org/x/crypto v0.0.0-20210711020723-a769d52b0f97/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
-golang.org/x/crypto v0.10.0 h1:LKqV2xt9+kDzSTfOhx4FrkEBcMrAgHSYgzywV9zcGmM=
-golang.org/x/crypto v0.10.0/go.mod h1:o4eNf7Ede1fv+hwOwZsTHl9EsPFO6q6ZvYR8vYfY45I=
+golang.org/x/crypto v0.14.0 h1:wBqGXzWJW6m1XrIKlAH0Hs1JJ7+9KBwnIO8v66Q9cHc=
+golang.org/x/crypto v0.14.0/go.mod h1:MVFd36DqK4CsrnJYDkBA3VC4m2GkXAM0PvzMCn4JQf4=
 golang.org/x/exp v0.0.0-20230817173708-d852ddb80c63 h1:m64FZMko/V45gv0bNmrNYoDEq8U5YUhetc9cBWKS1TQ=
 golang.org/x/exp v0.0.0-20230817173708-d852ddb80c63/go.mod h1:0v4NqG35kSWCMzLaMeX+IQrlSnVE/bqGSyC2cz/9Le8=
 golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=
-golang.org/x/net v0.10.0 h1:X2//UzNDwYmtCLn7To6G58Wr6f5ahEAQgKNzv9Y951M=
-golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=
+golang.org/x/net v0.17.0 h1:pVaXccu2ozPjCXewfr1S7xza/zcXTity9cCdXQYSjIM=
+golang.org/x/net v0.17.0/go.mod h1:NxSsAGuq816PNPmqtQdLE42eU2Fs7NoRIZrHJAlaCOE=
 golang.org/x/sync v0.3.0 h1:ftCYgMx6zT/asHUrPw8BLLscYtGznsLAnjq5RH9P66E=
 golang.org/x/sync v0.3.0/go.mod h1:FU7BRWz2tNW+3quACPkgCx/L+uEAv1htQ0V83Z9Rj+Y=
 golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20210806184541-e5e7981a1069/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
-golang.org/x/sys v0.0.0-20220310020820-b874c991c1a5/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20220704084225-05e143d24a9e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
-golang.org/x/sys v0.11.0 h1:eG7RXZHdqOJ1i+0lgLgCpSXAp6M3LYlAo6osgSi0xOM=
-golang.org/x/sys v0.11.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+golang.org/x/sys v0.13.0 h1:Af8nKPmuFypiUBjVoU9V20FiaFXOcuZI21p0ycVYYGE=
+golang.org/x/sys v0.13.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
-golang.org/x/term v0.10.0 h1:3R7pNqamzBraeqj/Tj8qt1aQ2HpmlC+Cx/qL/7hn4/c=
-golang.org/x/term v0.10.0/go.mod h1:lpqdcUyK/oCiQxvxVrppt5ggO2KCZ5QblwqPnfZ6d5o=
+golang.org/x/term v0.13.0 h1:bb+I9cTfFazGW51MZqBVmZy7+JEJMouUHTUSKVQLBek=
+golang.org/x/term v0.13.0/go.mod h1:LTmsnFJwVN6bCy1rVCoS+qHT1HhALEFxKncY3WNNh4U=
 golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
 golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
-golang.org/x/text v0.10.0 h1:UpjohKhiEgNc0CSauXmwYftY1+LlaC75SJwh0SgCX58=
-golang.org/x/text v0.10.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
+golang.org/x/text v0.13.0 h1:ablQoSUd0tRdKxZewP80B+BaqeKJuVhuRxj/dkrun3k=
+golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
 golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
 golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
--- a/llm/ggml.go
+++ b/llm/ggml.go
@@ -175,7 +175,8 @@ const (
 	// Magic constant for `ggla` files (LoRA adapter).
 	FILE_MAGIC_GGLA = 0x67676C61
 	// Magic constant for `gguf` files (versioned, gguf)
-	FILE_MAGIC_GGUF = 0x46554747
+	FILE_MAGIC_GGUF_LE = 0x46554747
+	FILE_MAGIC_GGUF_BE = 0x47475546
 )

 func DecodeGGML(r io.ReadSeeker) (*GGML, error) {
@@ -191,8 +192,10 @@ func DecodeGGML(r io.ReadSeeker) (*GGML, error) {
 		ggml.container = &containerGGJT{}
 	case FILE_MAGIC_GGLA:
 		ggml.container = &containerLORA{}
-	case FILE_MAGIC_GGUF:
-		ggml.container = &containerGGUF{}
+	case FILE_MAGIC_GGUF_LE:
+		ggml.container = &containerGGUF{bo: binary.LittleEndian}
+	case FILE_MAGIC_GGUF_BE:
+		ggml.container = &containerGGUF{bo: binary.BigEndian}
 	default:
 		return nil, errors.New("invalid file magic")
 	}
--- a/llm/gguf.go
+++ b/llm/gguf.go
@@ -3,12 +3,15 @@ package llm
 import (
 	"bytes"
 	"encoding/binary"
-	"errors"
 	"fmt"
 	"io"
+
+	"github.com/jmorganca/ollama/format"
 )

 type containerGGUF struct {
+	bo binary.ByteOrder
+
 	Version uint32

 	V1 struct {
@@ -20,6 +23,8 @@ type containerGGUF struct {
 		NumTensor uint64
 		NumKV     uint64
 	}
+
+	parameters uint64
 }

 func (c *containerGGUF) Name() string {
@@ -27,15 +32,13 @@ func (c *containerGGUF) Name() string {
 }

 func (c *containerGGUF) Decode(r io.Reader) (model, error) {
-	binary.Read(r, binary.LittleEndian, &c.Version)
+	binary.Read(r, c.bo, &c.Version)

 	switch c.Version {
 	case 1:
-		binary.Read(r, binary.LittleEndian, &c.V1)
-	case 2:
-		binary.Read(r, binary.LittleEndian, &c.V2)
+		binary.Read(r, c.bo, &c.V1)
 	default:
-		return nil, errors.New("invalid version")
+		binary.Read(r, c.bo, &c.V2)
 	}

 	model := newGGUFModel(c)
@@ -76,6 +79,14 @@ func newGGUFModel(container *containerGGUF) *ggufModel {
 	}
 }

+func (llm *ggufModel) NumTensor() uint64 {
+	if llm.Version == 1 {
+		return uint64(llm.V1.NumTensor)
+	}
+
+	return llm.V2.NumTensor
+}
+
 func (llm *ggufModel) NumKV() uint64 {
 	if llm.Version == 1 {
 		return uint64(llm.V1.NumKV)
@@ -94,6 +105,10 @@ func (llm *ggufModel) ModelFamily() string {
 }

 func (llm *ggufModel) ModelType() string {
+	if llm.parameters > 0 {
+		return format.HumanNumber(llm.parameters)
+	}
+
 	switch llm.ModelFamily() {
 	case "llama":
 		if blocks, ok := llm.kv["llama.block_count"].(uint32); ok {
@@ -128,13 +143,9 @@ func (llm *ggufModel) FileType() string {
 }

 func (llm *ggufModel) Decode(r io.Reader) error {
-	read := llm.readString
-	if llm.Version == 1 {
-		read = llm.readStringV1
-	}
-
+	// decode key-values
 	for i := 0; uint64(i) < llm.NumKV(); i++ {
-		k, err := read(r)
+		k, err := llm.readString(r)
 		if err != nil {
 			return err
 		}
@@ -166,24 +177,14 @@ func (llm *ggufModel) Decode(r io.Reader) error {
 		case ggufTypeBool:
 			v = llm.readBool(r)
 		case ggufTypeString:
-			fn := llm.readString
-			if llm.Version == 1 {
-				fn = llm.readStringV1
-			}
-
-			s, err := fn(r)
+			s, err := llm.readString(r)
 			if err != nil {
 				return err
 			}

 			v = s
 		case ggufTypeArray:
-			fn := llm.readArray
-			if llm.Version == 1 {
-				fn = llm.readArrayV1
-			}
-
-			a, err := fn(r)
+			a, err := llm.readArray(r)
 			if err != nil {
 				return err
 			}
@@ -196,6 +197,25 @@ func (llm *ggufModel) Decode(r io.Reader) error {
 		llm.kv[k] = v
 	}

+	// decode tensors
+	for i := 0; uint64(i) < llm.NumTensor(); i++ {
+		if _, err := llm.readString(r); err != nil {
+			return err
+		}
+
+		dimensions := llm.readU32(r)
+
+		var elements uint64 = 1
+		for i := 0; uint32(i) < dimensions; i++ {
+			elements *= llm.readU64(r)
+		}
+
+		llm.readU32(r) // type
+		llm.readU64(r) // offset
+
+		llm.parameters += elements
+	}
+
 	return nil
 }

@@ -209,75 +229,75 @@ func (llm *ggufModel) NumLayers() int64 {
 	return int64(v)
 }

-func (ggufModel) readU8(r io.Reader) uint8 {
+func (llm ggufModel) readU8(r io.Reader) uint8 {
 	var u8 uint8
-	binary.Read(r, binary.LittleEndian, &u8)
+	binary.Read(r, llm.bo, &u8)
 	return u8
 }

-func (ggufModel) readI8(r io.Reader) int8 {
+func (llm ggufModel) readI8(r io.Reader) int8 {
 	var i8 int8
-	binary.Read(r, binary.LittleEndian, &i8)
+	binary.Read(r, llm.bo, &i8)
 	return i8
 }

-func (ggufModel) readU16(r io.Reader) uint16 {
+func (llm ggufModel) readU16(r io.Reader) uint16 {
 	var u16 uint16
-	binary.Read(r, binary.LittleEndian, &u16)
+	binary.Read(r, llm.bo, &u16)
 	return u16
 }

-func (ggufModel) readI16(r io.Reader) int16 {
+func (llm ggufModel) readI16(r io.Reader) int16 {
 	var i16 int16
-	binary.Read(r, binary.LittleEndian, &i16)
+	binary.Read(r, llm.bo, &i16)
 	return i16
 }

-func (ggufModel) readU32(r io.Reader) uint32 {
+func (llm ggufModel) readU32(r io.Reader) uint32 {
 	var u32 uint32
-	binary.Read(r, binary.LittleEndian, &u32)
+	binary.Read(r, llm.bo, &u32)
 	return u32
 }

-func (ggufModel) readI32(r io.Reader) int32 {
+func (llm ggufModel) readI32(r io.Reader) int32 {
 	var i32 int32
-	binary.Read(r, binary.LittleEndian, &i32)
+	binary.Read(r, llm.bo, &i32)
 	return i32
 }

-func (ggufModel) readU64(r io.Reader) uint64 {
+func (llm ggufModel) readU64(r io.Reader) uint64 {
 	var u64 uint64
-	binary.Read(r, binary.LittleEndian, &u64)
+	binary.Read(r, llm.bo, &u64)
 	return u64
 }

-func (ggufModel) readI64(r io.Reader) int64 {
+func (llm ggufModel) readI64(r io.Reader) int64 {
 	var i64 int64
-	binary.Read(r, binary.LittleEndian, &i64)
+	binary.Read(r, llm.bo, &i64)
 	return i64
 }

-func (ggufModel) readF32(r io.Reader) float32 {
+func (llm ggufModel) readF32(r io.Reader) float32 {
 	var f32 float32
-	binary.Read(r, binary.LittleEndian, &f32)
+	binary.Read(r, llm.bo, &f32)
 	return f32
 }

-func (ggufModel) readF64(r io.Reader) float64 {
+func (llm ggufModel) readF64(r io.Reader) float64 {
 	var f64 float64
-	binary.Read(r, binary.LittleEndian, &f64)
+	binary.Read(r, llm.bo, &f64)
 	return f64
 }

-func (ggufModel) readBool(r io.Reader) bool {
+func (llm ggufModel) readBool(r io.Reader) bool {
 	var b bool
-	binary.Read(r, binary.LittleEndian, &b)
+	binary.Read(r, llm.bo, &b)
 	return b
 }

-func (ggufModel) readStringV1(r io.Reader) (string, error) {
+func (llm ggufModel) readStringV1(r io.Reader) (string, error) {
 	var nameLength uint32
-	binary.Read(r, binary.LittleEndian, &nameLength)
+	binary.Read(r, llm.bo, &nameLength)

 	var b bytes.Buffer
 	if _, err := io.CopyN(&b, r, int64(nameLength)); err != nil {
@@ -291,8 +311,12 @@ func (ggufModel) readStringV1(r io.Reader) (string, error) {
 }

 func (llm ggufModel) readString(r io.Reader) (string, error) {
+	if llm.Version == 1 {
+		return llm.readStringV1(r)
+	}
+
 	var nameLength uint64
-	binary.Read(r, binary.LittleEndian, &nameLength)
+	binary.Read(r, llm.bo, &nameLength)

 	var b bytes.Buffer
 	if _, err := io.CopyN(&b, r, int64(nameLength)); err != nil {
@@ -340,6 +364,10 @@ func (llm *ggufModel) readArrayV1(r io.Reader) (arr []any, err error) {
 }

 func (llm *ggufModel) readArray(r io.Reader) (arr []any, err error) {
+	if llm.Version == 1 {
+		return llm.readArrayV1(r)
+	}
+
 	atype := llm.readU32(r)
 	n := llm.readU64(r)

--- a/llm/llama.cpp/generate_darwin_amd64.go
+++ b/llm/llama.cpp/generate_darwin_amd64.go
@@ -12,7 +12,8 @@ package llm
 //go:generate mv ggml/build/cpu/bin/server ggml/build/cpu/bin/ollama-runner

 //go:generate git submodule update --force gguf
-//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
+//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
+//go:generate git -C gguf apply ../patches/0001-metal-handle-ggml_scale-for-n-4-0-close-3754.patch
 //go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build gguf/build/cpu --target server --config Release
 //go:generate mv gguf/build/cpu/bin/server gguf/build/cpu/bin/ollama-runner
--- a/llm/llama.cpp/generate_darwin_arm64.go
+++ b/llm/llama.cpp/generate_darwin_arm64.go
@@ -12,7 +12,8 @@ package llm
 //go:generate mv ggml/build/metal/bin/server ggml/build/metal/bin/ollama-runner

 //go:generate git submodule update --force gguf
-//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
+//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
+//go:generate git -C gguf apply ../patches/0001-metal-handle-ggml_scale-for-n-4-0-close-3754.patch
 //go:generate cmake -S gguf -B gguf/build/metal -DLLAMA_METAL=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=arm64 -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build gguf/build/metal --target server --config Release
 //go:generate mv gguf/build/metal/bin/server gguf/build/metal/bin/ollama-runner
--- a/llm/llama.cpp/generate_linux.go
+++ b/llm/llama.cpp/generate_linux.go
@@ -13,14 +13,14 @@ package llm

 //go:generate git submodule update --force gguf
 //go:generate git -C gguf apply ../patches/0001-copy-cuda-runtime-libraries.patch
-//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
-//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on
+//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
+//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
 //go:generate cmake --build gguf/build/cpu --target server --config Release
 //go:generate mv gguf/build/cpu/bin/server gguf/build/cpu/bin/ollama-runner

 //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
 //go:generate cmake --build ggml/build/cuda --target server --config Release
 //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
-//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
+//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
 //go:generate cmake --build gguf/build/cuda --target server --config Release
 //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner
--- a/llm/llama.cpp/generate_windows.go
+++ b/llm/llama.cpp/generate_windows.go
@@ -10,7 +10,7 @@ package llm
 //go:generate cmd /c move ggml\build\cpu\bin\Release\server.exe ggml\build\cpu\bin\Release\ollama-runner.exe

 //go:generate git submodule update --force gguf
-//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
-//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on
+//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
+//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
 //go:generate cmake --build gguf/build/cpu --target server --config Release
 //go:generate cmd /c move gguf\build\cpu\bin\Release\server.exe gguf\build\cpu\bin\Release\ollama-runner.exe
--- a/llm/llama.cpp/gguf
+++ b/llm/llama.cpp/gguf
--- a/llm/llama.cpp/patches/0001-metal-handle-ggml_scale-for-n-4-0-close-3754.patch
+++ b/llm/llama.cpp/patches/0001-metal-handle-ggml_scale-for-n-4-0-close-3754.patch
@@ -0,0 +1,91 @@
+From 469c9addef75893e6be12edda852d12e840bf064 Mon Sep 17 00:00:00 2001
+From: Georgi Gerganov <ggerganov@gmail.com>
+Date: Tue, 24 Oct 2023 09:46:50 +0300
+Subject: [PATCH 1/2] metal : handle ggml_scale for n%4 != 0 (close #3754)
+
+ggml-ci
+---
+ ggml-metal.m     | 18 +++++++++++++-----
+ ggml-metal.metal | 10 +++++++++-
+ 2 files changed, 22 insertions(+), 6 deletions(-)
+
+diff --git a/ggml-metal.m b/ggml-metal.m
+index c908106..c1901dc 100644
+--- a/ggml-metal.m
+++ b/ggml-metal.m
+@@ -62,6 +62,7 @@
+     GGML_METAL_DECL_KERNEL(mul);
+     GGML_METAL_DECL_KERNEL(mul_row); // TODO: avoid this extra kernel, instead extend the "mul" kernel to support broadcast
+     GGML_METAL_DECL_KERNEL(scale);
+    GGML_METAL_DECL_KERNEL(scale_4);
+     GGML_METAL_DECL_KERNEL(silu);
+     GGML_METAL_DECL_KERNEL(relu);
+     GGML_METAL_DECL_KERNEL(gelu);
+@@ -249,6 +250,7 @@ static void ggml_metal_log(enum ggml_log_level level, const char* format, ...){
+         GGML_METAL_ADD_KERNEL(mul);
+         GGML_METAL_ADD_KERNEL(mul_row);
+         GGML_METAL_ADD_KERNEL(scale);
+        GGML_METAL_ADD_KERNEL(scale_4);
+         GGML_METAL_ADD_KERNEL(silu);
+         GGML_METAL_ADD_KERNEL(relu);
+         GGML_METAL_ADD_KERNEL(gelu);
+@@ -347,6 +349,7 @@ void ggml_metal_free(struct ggml_metal_context * ctx) {
+     GGML_METAL_DEL_KERNEL(mul);
+     GGML_METAL_DEL_KERNEL(mul_row);
+     GGML_METAL_DEL_KERNEL(scale);
+    GGML_METAL_DEL_KERNEL(scale_4);
+     GGML_METAL_DEL_KERNEL(silu);
+     GGML_METAL_DEL_KERNEL(relu);
+     GGML_METAL_DEL_KERNEL(gelu);
+@@ -923,15 +926,20 @@ void ggml_metal_graph_compute(
+ 
+                             const float scale = *(const float *) src1->data;
+ 
+-                            [encoder setComputePipelineState:ctx->pipeline_scale];
+                            int64_t n = ggml_nelements(dst);
+
+                            if (n % 4 == 0) {
+                                n /= 4;
+                                [encoder setComputePipelineState:ctx->pipeline_scale_4];
+                            } else {
+                                [encoder setComputePipelineState:ctx->pipeline_scale];
+                            }
+
+                             [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
+                             [encoder setBuffer:id_dst  offset:offs_dst  atIndex:1];
+                             [encoder setBytes:&scale length:sizeof(scale) atIndex:2];
+ 
+-                            const int64_t n = ggml_nelements(dst);
+-                            GGML_ASSERT(n % 4 == 0);
+-
+-                            [encoder dispatchThreadgroups:MTLSizeMake(n/4, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
+                            [encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
+                         } break;
+                     case GGML_OP_UNARY:
+                         switch (ggml_get_unary_op(gf->nodes[i])) {
+diff --git a/ggml-metal.metal b/ggml-metal.metal
+index 69fc713..f4b4605 100644
+--- a/ggml-metal.metal
+++ b/ggml-metal.metal
+@@ -125,9 +125,17 @@ kernel void kernel_mul_row(
+ }
+ 
+ kernel void kernel_scale(
+        device const float * src0,
+        device       float * dst,
+        constant     float & scale,
+        uint tpig[[thread_position_in_grid]]) {
+    dst[tpig] = src0[tpig] * scale;
+}
+
+kernel void kernel_scale_4(
+         device const float4 * src0,
+         device       float4 * dst,
+-        constant     float & scale,
+        constant     float  & scale,
+         uint tpig[[thread_position_in_grid]]) {
+     dst[tpig] = src0[tpig] * scale;
+ }
+-- 
+2.39.3 (Apple Git-145)
+
--- a/llm/llama.cpp/patches/0001-remove-warm-up-logging.patch
+++ b/llm/llama.cpp/patches/0001-remove-warm-up-logging.patch
@@ -1,25 +0,0 @@
-From 8dbb5449db259a9c24796e7927d89bee98b6c8f5 Mon Sep 17 00:00:00 2001
-From: Bruce MacDonald <brucewmacdonald@gmail.com>
-Date: Thu, 5 Oct 2023 11:21:12 -0400
-Subject: [PATCH] remove warm up logging
-
---
- common/common.cpp | 2 --
- 1 file changed, 2 deletions(-)
-
-diff --git a/common/common.cpp b/common/common.cpp
-index 7370017..c4433fe 100644
--- a/common/common.cpp
-+++ b/common/common.cpp
-@@ -839,8 +839,6 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
-     }
- 
-     {
-        LOG("warming up the model with an empty run\n");
-
-         std::vector<llama_token> tmp = { llama_token_bos(lctx), llama_token_eos(lctx), };
-         llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
-         llama_kv_cache_tokens_rm(lctx, -1, -1);
-- 
-2.39.2 (Apple Git-143)
-
--- a/llm/llama.cpp/patches/0001-update-default-log-target.patch
+++ b/llm/llama.cpp/patches/0001-update-default-log-target.patch
@@ -0,0 +1,25 @@
+From 6465fec6290f0a7f5d4d0fbe6bcf634e4810dde6 Mon Sep 17 00:00:00 2001
+From: Michael Yang <mxyng@pm.me>
+Date: Mon, 23 Oct 2023 10:39:34 -0700
+Subject: [PATCH] default log stderr
+
+---
+ common/log.h | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/common/log.h b/common/log.h
+index b8953fd..25522cd 100644
+--- a/common/log.h
+++ b/common/log.h
+@@ -90,7 +90,7 @@
+ //  }
+ //
+ #ifndef LOG_TARGET
+-    #define LOG_TARGET log_handler()
+    #define LOG_TARGET nullptr
+ #endif
+ 
+ #ifndef LOG_TEE_TARGET
+-- 
+2.42.0
+
--- a/llm/llama.go
+++ b/llm/llama.go
@@ -27,6 +27,34 @@ import (
 	"github.com/jmorganca/ollama/format"
 )

+const jsonGrammar = `
+root   ::= object
+value  ::= object | array | string | number | ("true" | "false" | "null") ws
+
+object ::=
+  "{" ws (
+            string ":" ws value
+    ("," ws string ":" ws value)*
+  )? "}" ws
+
+array  ::=
+  "[" ws (
+            value
+    ("," ws value)*
+  )? "]" ws
+
+string ::=
+  "\"" (
+    [^"\\] |
+    "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
+  )* "\"" ws
+
+number ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws
+
+# Optional space: by convention, applied in this grammar after literal chars when allowed
+ws ::= ([ \t\n] ws)?
+`
+
 //go:embed llama.cpp/*/build/*/bin/*
 var llamaCppEmbed embed.FS

@@ -196,7 +224,10 @@ type llama struct {
 	Running
 }

-var errNoGPU = errors.New("nvidia-smi command failed")
+var (
+	errNvidiaSMI     = errors.New("nvidia-smi command failed")
+	errAvailableVRAM = errors.New("not enough VRAM available, falling back to CPU only")
+)

 // CheckVRAM returns the free VRAM in bytes on Linux machines with NVIDIA GPUs
 func CheckVRAM() (int64, error) {
@@ -205,13 +236,17 @@ func CheckVRAM() (int64, error) {
 	cmd.Stdout = &stdout
 	err := cmd.Run()
 	if err != nil {
-		return 0, errNoGPU
+		return 0, errNvidiaSMI
 	}

 	var freeMiB int64
 	scanner := bufio.NewScanner(&stdout)
 	for scanner.Scan() {
 		line := scanner.Text()
+		if strings.Contains(line, "[Insufficient Permissions]") {
+			return 0, fmt.Errorf("GPU support may not enabled, check you have installed GPU drivers and have the necessary permissions to run nvidia-smi")
+		}
+
 		vram, err := strconv.ParseInt(strings.TrimSpace(line), 10, 64)
 		if err != nil {
 			return 0, fmt.Errorf("failed to parse available VRAM: %v", err)
@@ -222,8 +257,8 @@ func CheckVRAM() (int64, error) {

 	freeBytes := freeMiB * 1024 * 1024
 	if freeBytes < 2*format.GigaByte {
-		log.Printf("less than 2 GB VRAM available, falling back to CPU only")
-		freeMiB = 0
+		log.Printf("less than 2 GB VRAM available")
+		return 0, errAvailableVRAM
 	}

 	return freeBytes, nil
@@ -236,19 +271,22 @@ func NumGPU(numLayer, fileSizeBytes int64, opts api.Options) int {
 	if runtime.GOOS == "linux" {
 		freeBytes, err := CheckVRAM()
 		if err != nil {
-			if err.Error() != "nvidia-smi command failed" {
+			if !errors.Is(err, errNvidiaSMI) {
 				log.Print(err.Error())
 			}
 			// nvidia driver not installed or no nvidia GPU found
 			return 0
 		}

-		// Calculate bytes per layer
-		// TODO: this is a rough heuristic, better would be to calculate this based on number of layers and context size
+		/*
+		 Calculate bytes per layer, this will roughly be the size of the model file divided by the number of layers.
+		 We can store the model weights and the kv cache in vram,
+		 to enable kv chache vram storage add two additional layers to the number of layers retrieved from the model file.
+		*/
 		bytesPerLayer := fileSizeBytes / numLayer

-		// max number of layers we can fit in VRAM, subtract 8% to prevent consuming all available VRAM and running out of memory
-		layers := int(freeBytes/bytesPerLayer) * 92 / 100
+		// 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors
+		layers := int(freeBytes/bytesPerLayer) * 3 / 4
 		log.Printf("%d MB VRAM available, loading up to %d GPU layers", freeBytes/(1024*1024), layers)

 		return layers
@@ -299,13 +337,19 @@ func newLlama(model string, adapters []string, runners []ModelRunner, numLayers
 	params := []string{
 		"--model", model,
 		"--ctx-size", fmt.Sprintf("%d", opts.NumCtx),
-		"--rope-freq-base", fmt.Sprintf("%f", opts.RopeFrequencyBase),
-		"--rope-freq-scale", fmt.Sprintf("%f", opts.RopeFrequencyScale),
 		"--batch-size", fmt.Sprintf("%d", opts.NumBatch),
 		"--n-gpu-layers", fmt.Sprintf("%d", numGPU),
 		"--embedding",
 	}

+	if opts.RopeFrequencyBase > 0 {
+		params = append(params, "--rope-freq-base", fmt.Sprintf("%f", opts.RopeFrequencyBase))
+	}
+
+	if opts.RopeFrequencyScale > 0 {
+		params = append(params, "--rope-freq-scale", fmt.Sprintf("%f", opts.RopeFrequencyScale))
+	}
+
 	if opts.NumGQA > 0 {
 		params = append(params, "--gqa", fmt.Sprintf("%d", opts.NumGQA))
 	}
@@ -353,7 +397,15 @@ func newLlama(model string, adapters []string, runners []ModelRunner, numLayers
 			runner.Path,
 			append(params, "--port", strconv.Itoa(port))...,
 		)
-		cmd.Env = append(os.Environ(), fmt.Sprintf("LD_LIBRARY_PATH=%s", filepath.Dir(runner.Path)))
+
+		var libraryPaths []string
+		if libraryPath, ok := os.LookupEnv("LD_LIBRARY_PATH"); ok {
+			libraryPaths = append(libraryPaths, libraryPath)
+		}
+
+		libraryPaths = append(libraryPaths, filepath.Dir(runner.Path))
+
+		cmd.Env = append(os.Environ(), fmt.Sprintf("LD_LIBRARY_PATH=%s", strings.Join(libraryPaths, ":")))
 		cmd.Stdout = os.Stderr
 		statusWriter := NewStatusWriter()
 		cmd.Stderr = statusWriter
@@ -473,7 +525,7 @@ type prediction struct {

 const maxBufferSize = 512 * format.KiloByte

-func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string, fn func(api.GenerateResponse)) error {
+func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string, format string, fn func(api.GenerateResponse)) error {
 	prevConvo, err := llm.Decode(ctx, prevContext)
 	if err != nil {
 		return err
@@ -508,6 +560,10 @@ func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string,
 		"stop":              llm.Stop,
 	}

+	if format == "json" {
+		request["grammar"] = jsonGrammar
+	}
+
 	// Handling JSON marshaling with special characters unescaped.
 	buffer := &bytes.Buffer{}
 	enc := json.NewEncoder(buffer)
--- a/llm/llm.go
+++ b/llm/llm.go
@@ -14,7 +14,7 @@ import (
 )

 type LLM interface {
-	Predict(context.Context, []int, string, func(api.GenerateResponse)) error
+	Predict(context.Context, []int, string, string, func(api.GenerateResponse)) error
 	Embedding(context.Context, string) ([]float64, error)
 	Encode(context.Context, string) ([]int, error)
 	Decode(context.Context, []int) (string, error)
@@ -85,7 +85,10 @@ func New(workDir, model string, adapters []string, opts api.Options) (LLM, error

 	switch ggml.Name() {
 	case "gguf":
-		opts.NumGQA = 0 // TODO: remove this when llama.cpp runners differ enough to need separate newLlama functions
+		// TODO: gguf will load these options automatically from the model binary
+		opts.NumGQA = 0
+		opts.RopeFrequencyBase = 0.0
+		opts.RopeFrequencyScale = 0.0
 		return newLlama(model, adapters, chooseRunners(workDir, "gguf"), ggml.NumLayers(), opts)
 	case "ggml", "ggmf", "ggjt", "ggla":
 		return newLlama(model, adapters, chooseRunners(workDir, "ggml"), ggml.NumLayers(), opts)
--- a/progressbar/progressbar.go
+++ b/progressbar/progressbar.go
@@ -291,7 +291,7 @@ func OptionShowDescriptionAtLineEnd() Option {
 	}
 }

-var defaultTheme = Theme{Saucer: "█", SaucerPadding: " ", BarStart: "|", BarEnd: "|"}
+var defaultTheme = Theme{Saucer: "█", SaucerPadding: " ", BarStart: "▕", BarEnd: "▏"}

 // NewOptions constructs a new instance of ProgressBar, with any options you specify
 func NewOptions(max int, options ...Option) *ProgressBar {
--- a/readline/buffer.go
+++ b/readline/buffer.go
@@ -0,0 +1,372 @@
+package readline
+
+import (
+	"fmt"
+	"os"
+
+	"github.com/emirpasic/gods/lists/arraylist"
+	"golang.org/x/term"
+)
+
+type Buffer struct {
+	Pos       int
+	Buf       *arraylist.List
+	Prompt    *Prompt
+	LineWidth int
+	Width     int
+	Height    int
+}
+
+func NewBuffer(prompt *Prompt) (*Buffer, error) {
+	fd := int(os.Stdout.Fd())
+	width, height, err := term.GetSize(fd)
+	if err != nil {
+		fmt.Println("Error getting size:", err)
+		return nil, err
+	}
+
+	lwidth := width - len(prompt.Prompt)
+	if prompt.UseAlt {
+		lwidth = width - len(prompt.AltPrompt)
+	}
+
+	b := &Buffer{
+		Pos:       0,
+		Buf:       arraylist.New(),
+		Prompt:    prompt,
+		Width:     width,
+		Height:    height,
+		LineWidth: lwidth,
+	}
+
+	return b, nil
+}
+
+func (b *Buffer) MoveLeft() {
+	if b.Pos > 0 {
+		if b.Pos%b.LineWidth == 0 {
+			fmt.Printf(CursorUp + CursorBOL + cursorRightN(b.Width))
+		} else {
+			fmt.Print(CursorLeft)
+		}
+		b.Pos -= 1
+	}
+}
+
+func (b *Buffer) MoveLeftWord() {
+	if b.Pos > 0 {
+		var foundNonspace bool
+		for {
+			v, _ := b.Buf.Get(b.Pos - 1)
+			if v == ' ' {
+				if foundNonspace {
+					break
+				}
+			} else {
+				foundNonspace = true
+			}
+			b.MoveLeft()
+
+			if b.Pos == 0 {
+				break
+			}
+		}
+	}
+}
+
+func (b *Buffer) MoveRight() {
+	if b.Pos < b.Size() {
+		b.Pos += 1
+		if b.Pos%b.LineWidth == 0 {
+			fmt.Printf(CursorDown + CursorBOL + cursorRightN(b.PromptSize()))
+		} else {
+			fmt.Print(CursorRight)
+		}
+	}
+}
+
+func (b *Buffer) MoveRightWord() {
+	if b.Pos < b.Size() {
+		for {
+			b.MoveRight()
+			v, _ := b.Buf.Get(b.Pos)
+			if v == ' ' {
+				break
+			}
+
+			if b.Pos == b.Size() {
+				break
+			}
+		}
+	}
+}
+
+func (b *Buffer) MoveToStart() {
+	if b.Pos > 0 {
+		currLine := b.Pos / b.LineWidth
+		if currLine > 0 {
+			for cnt := 0; cnt < currLine; cnt++ {
+				fmt.Print(CursorUp)
+			}
+		}
+		fmt.Printf(CursorBOL + cursorRightN(b.PromptSize()))
+		b.Pos = 0
+	}
+}
+
+func (b *Buffer) MoveToEnd() {
+	if b.Pos < b.Size() {
+		currLine := b.Pos / b.LineWidth
+		totalLines := b.Size() / b.LineWidth
+		if currLine < totalLines {
+			for cnt := 0; cnt < totalLines-currLine; cnt++ {
+				fmt.Print(CursorDown)
+			}
+			remainder := b.Size() % b.LineWidth
+			fmt.Printf(CursorBOL + cursorRightN(b.PromptSize()+remainder))
+		} else {
+			fmt.Print(cursorRightN(b.Size() - b.Pos))
+		}
+
+		b.Pos = b.Size()
+	}
+}
+
+func (b *Buffer) Size() int {
+	return b.Buf.Size()
+}
+
+func min(n, m int) int {
+	if n > m {
+		return m
+	}
+	return n
+}
+
+func (b *Buffer) PromptSize() int {
+	if b.Prompt.UseAlt {
+		return len(b.Prompt.AltPrompt)
+	}
+	return len(b.Prompt.Prompt)
+}
+
+func (b *Buffer) Add(r rune) {
+	if b.Pos == b.Buf.Size() {
+		fmt.Printf("%c", r)
+		b.Buf.Add(r)
+		b.Pos += 1
+		if b.Pos > 0 && b.Pos%b.LineWidth == 0 {
+			fmt.Printf("\n%s", b.Prompt.AltPrompt)
+		}
+	} else {
+		fmt.Printf("%c", r)
+		b.Buf.Insert(b.Pos, r)
+		b.Pos += 1
+		if b.Pos > 0 && b.Pos%b.LineWidth == 0 {
+			fmt.Printf("\n%s", b.Prompt.AltPrompt)
+		}
+		b.drawRemaining()
+	}
+}
+
+func (b *Buffer) drawRemaining() {
+	var place int
+	remainingText := b.StringN(b.Pos)
+	if b.Pos > 0 {
+		place = b.Pos % b.LineWidth
+	}
+	fmt.Print(CursorHide)
+
+	// render the rest of the current line
+	currLine := remainingText[:min(b.LineWidth-place, len(remainingText))]
+	if len(currLine) > 0 {
+		fmt.Printf(ClearToEOL + currLine)
+		fmt.Print(cursorLeftN(len(currLine)))
+	} else {
+		fmt.Print(ClearToEOL)
+	}
+
+	// render the other lines
+	if len(remainingText) > len(currLine) {
+		remaining := []rune(remainingText[len(currLine):])
+		var totalLines int
+		for i, c := range remaining {
+			if i%b.LineWidth == 0 {
+				fmt.Printf("\n%s", b.Prompt.AltPrompt)
+				totalLines += 1
+			}
+			fmt.Printf("%c", c)
+		}
+		fmt.Print(ClearToEOL)
+		fmt.Print(cursorUpN(totalLines))
+		fmt.Printf(CursorBOL + cursorRightN(b.Width-len(currLine)))
+	}
+
+	fmt.Print(CursorShow)
+}
+
+func (b *Buffer) Remove() {
+	if b.Buf.Size() > 0 && b.Pos > 0 {
+		if b.Pos%b.LineWidth == 0 {
+			// if the user backspaces over the word boundary, do this magic to clear the line
+			// and move to the end of the previous line
+			fmt.Printf(CursorBOL + ClearToEOL)
+			fmt.Printf(CursorUp + CursorBOL + cursorRightN(b.Width) + " " + CursorLeft)
+		} else {
+			fmt.Printf(CursorLeft + " " + CursorLeft)
+		}
+
+		var eraseExtraLine bool
+		if (b.Size()-1)%b.LineWidth == 0 {
+			eraseExtraLine = true
+		}
+
+		b.Pos -= 1
+		b.Buf.Remove(b.Pos)
+
+		if b.Pos < b.Size() {
+			b.drawRemaining()
+			// this erases a line which is left over when backspacing in the middle of a line and there
+			// are trailing characters which go over the line width boundary
+			if eraseExtraLine {
+				remainingLines := (b.Size() - b.Pos) / b.LineWidth
+				fmt.Printf(cursorDownN(remainingLines+1) + CursorBOL + ClearToEOL)
+				place := b.Pos % b.LineWidth
+				fmt.Printf(cursorUpN(remainingLines+1) + cursorRightN(place+len(b.Prompt.Prompt)))
+			}
+		}
+	}
+}
+
+func (b *Buffer) Delete() {
+	if b.Size() > 0 && b.Pos < b.Size() {
+		b.Buf.Remove(b.Pos)
+		b.drawRemaining()
+		if b.Size()%b.LineWidth == 0 {
+			if b.Pos != b.Size() {
+				remainingLines := (b.Size() - b.Pos) / b.LineWidth
+				fmt.Printf(cursorDownN(remainingLines) + CursorBOL + ClearToEOL)
+				place := b.Pos % b.LineWidth
+				fmt.Printf(cursorUpN(remainingLines) + cursorRightN(place+len(b.Prompt.Prompt)))
+			}
+		}
+	}
+}
+
+func (b *Buffer) DeleteBefore() {
+	if b.Pos > 0 {
+		for cnt := b.Pos - 1; cnt >= 0; cnt-- {
+			b.Remove()
+		}
+	}
+}
+
+func (b *Buffer) DeleteRemaining() {
+	if b.Size() > 0 && b.Pos < b.Size() {
+		charsToDel := b.Size() - b.Pos
+		for cnt := 0; cnt < charsToDel; cnt++ {
+			b.Delete()
+		}
+	}
+}
+
+func (b *Buffer) DeleteWord() {
+	if b.Buf.Size() > 0 && b.Pos > 0 {
+		var foundNonspace bool
+		for {
+			v, _ := b.Buf.Get(b.Pos - 1)
+			if v == ' ' {
+				if !foundNonspace {
+					b.Remove()
+				} else {
+					break
+				}
+			} else {
+				foundNonspace = true
+				b.Remove()
+			}
+
+			if b.Pos == 0 {
+				break
+			}
+		}
+	}
+}
+
+func (b *Buffer) ClearScreen() {
+	fmt.Printf(ClearScreen + CursorReset + b.Prompt.Prompt)
+	if b.IsEmpty() {
+		ph := b.Prompt.Placeholder
+		fmt.Printf(ColorGrey + ph + cursorLeftN(len(ph)) + ColorDefault)
+	} else {
+		currPos := b.Pos
+		b.Pos = 0
+		b.drawRemaining()
+		fmt.Printf(CursorReset + cursorRightN(len(b.Prompt.Prompt)))
+		if currPos > 0 {
+			targetLine := currPos / b.LineWidth
+			if targetLine > 0 {
+				for cnt := 0; cnt < targetLine; cnt++ {
+					fmt.Print(CursorDown)
+				}
+			}
+			remainder := currPos % b.LineWidth
+			if remainder > 0 {
+				fmt.Print(cursorRightN(remainder))
+			}
+			if currPos%b.LineWidth == 0 {
+				fmt.Printf(CursorBOL + b.Prompt.AltPrompt)
+			}
+		}
+		b.Pos = currPos
+	}
+}
+
+func (b *Buffer) IsEmpty() bool {
+	return b.Buf.Empty()
+}
+
+func (b *Buffer) Replace(r []rune) {
+	b.Pos = 0
+	b.Buf.Clear()
+	fmt.Printf(ClearLine + CursorBOL + b.Prompt.Prompt)
+	for _, c := range r {
+		b.Add(c)
+	}
+}
+
+func (b *Buffer) String() string {
+	return b.StringN(0)
+}
+
+func (b *Buffer) StringN(n int) string {
+	return b.StringNM(n, 0)
+}
+
+func (b *Buffer) StringNM(n, m int) string {
+	var s string
+	if m == 0 {
+		m = b.Size()
+	}
+	for cnt := n; cnt < m; cnt++ {
+		c, _ := b.Buf.Get(cnt)
+		s += string(c.(rune))
+	}
+	return s
+}
+
+func cursorLeftN(n int) string {
+	return fmt.Sprintf(CursorLeftN, n)
+}
+
+func cursorRightN(n int) string {
+	return fmt.Sprintf(CursorRightN, n)
+}
+
+func cursorUpN(n int) string {
+	return fmt.Sprintf(CursorUpN, n)
+}
+
+func cursorDownN(n int) string {
+	return fmt.Sprintf(CursorDownN, n)
+}
--- a/readline/errors.go
+++ b/readline/errors.go
@@ -0,0 +1,17 @@
+package readline
+
+import (
+	"errors"
+)
+
+var (
+	ErrInterrupt = errors.New("Interrupt")
+)
+
+type InterruptError struct {
+	Line []rune
+}
+
+func (*InterruptError) Error() string {
+	return "Interrupted"
+}
--- a/readline/history.go
+++ b/readline/history.go
@@ -0,0 +1,152 @@
+package readline
+
+import (
+	"bufio"
+	"errors"
+	"io"
+	"os"
+	"path/filepath"
+	"strings"
+
+	"github.com/emirpasic/gods/lists/arraylist"
+)
+
+type History struct {
+	Buf      *arraylist.List
+	Autosave bool
+	Pos      int
+	Limit    int
+	Filename string
+	Enabled  bool
+}
+
+func NewHistory() (*History, error) {
+	h := &History{
+		Buf:      arraylist.New(),
+		Limit:    100, //resizeme
+		Autosave: true,
+		Enabled:  true,
+	}
+
+	err := h.Init()
+	if err != nil {
+		return nil, err
+	}
+
+	return h, nil
+}
+
+func (h *History) Init() error {
+	home, err := os.UserHomeDir()
+	if err != nil {
+		return err
+	}
+
+	path := filepath.Join(home, ".ollama", "history")
+	h.Filename = path
+
+	//todo check if the file exists
+	f, err := os.OpenFile(path, os.O_CREATE|os.O_RDONLY, 0600)
+	if err != nil {
+		if errors.Is(err, os.ErrNotExist) {
+			return nil
+		}
+		return err
+	}
+	defer f.Close()
+
+	r := bufio.NewReader(f)
+	for {
+		line, err := r.ReadString('\n')
+		if err != nil {
+			if err == io.EOF {
+				break
+			}
+			return err
+		}
+
+		line = strings.TrimSpace(line)
+		if len(line) == 0 {
+			continue
+		}
+
+		h.Add([]rune(line))
+	}
+
+	return nil
+}
+
+func (h *History) Add(l []rune) {
+	h.Buf.Add(l)
+	h.Compact()
+	h.Pos = h.Size()
+	if h.Autosave {
+		h.Save()
+	}
+}
+
+func (h *History) Compact() {
+	s := h.Buf.Size()
+	if s > h.Limit {
+		for cnt := 0; cnt < s-h.Limit; cnt++ {
+			h.Buf.Remove(0)
+		}
+	}
+}
+
+func (h *History) Clear() {
+	h.Buf.Clear()
+}
+
+func (h *History) Prev() []rune {
+	var line []rune
+	if h.Pos > 0 {
+		h.Pos -= 1
+	}
+	v, _ := h.Buf.Get(h.Pos)
+	line, _ = v.([]rune)
+	return line
+}
+
+func (h *History) Next() []rune {
+	var line []rune
+	if h.Pos < h.Buf.Size() {
+		h.Pos += 1
+		v, _ := h.Buf.Get(h.Pos)
+		line, _ = v.([]rune)
+	}
+	return line
+}
+
+func (h *History) Size() int {
+	return h.Buf.Size()
+}
+
+func (h *History) Save() error {
+	if !h.Enabled {
+		return nil
+	}
+
+	tmpFile := h.Filename + ".tmp"
+
+	f, err := os.OpenFile(tmpFile, os.O_CREATE|os.O_WRONLY|os.O_TRUNC|os.O_APPEND, 0666)
+	if err != nil {
+		return err
+	}
+	defer f.Close()
+
+	buf := bufio.NewWriter(f)
+	for cnt := 0; cnt < h.Size(); cnt++ {
+		v, _ := h.Buf.Get(cnt)
+		line, _ := v.([]rune)
+		buf.WriteString(string(line) + "\n")
+	}
+	buf.Flush()
+	f.Close()
+
+	if err = os.Rename(tmpFile, h.Filename); err != nil {
+		return err
+	}
+
+	return nil
+}
--- a/readline/readline.go
+++ b/readline/readline.go
@@ -0,0 +1,254 @@
+package readline
+
+import (
+	"bufio"
+	"fmt"
+	"io"
+	"os"
+	"syscall"
+)
+
+type Prompt struct {
+	Prompt         string
+	AltPrompt      string
+	Placeholder    string
+	AltPlaceholder string
+	UseAlt         bool
+}
+
+type Terminal struct {
+	outchan chan rune
+}
+
+type Instance struct {
+	Prompt   *Prompt
+	Terminal *Terminal
+	History  *History
+}
+
+func New(prompt Prompt) (*Instance, error) {
+	term, err := NewTerminal()
+	if err != nil {
+		return nil, err
+	}
+
+	history, err := NewHistory()
+	if err != nil {
+		return nil, err
+	}
+
+	return &Instance{
+		Prompt:   &prompt,
+		Terminal: term,
+		History:  history,
+	}, nil
+}
+
+func (i *Instance) Readline() (string, error) {
+	prompt := i.Prompt.Prompt
+	if i.Prompt.UseAlt {
+		prompt = i.Prompt.AltPrompt
+	}
+	fmt.Print(prompt)
+
+	fd := int(syscall.Stdin)
+	termios, err := SetRawMode(fd)
+	if err != nil {
+		return "", err
+	}
+	defer UnsetRawMode(fd, termios)
+
+	buf, _ := NewBuffer(i.Prompt)
+
+	var esc bool
+	var escex bool
+	var metaDel bool
+	var pasteMode PasteMode
+
+	var currentLineBuf []rune
+
+	for {
+		if buf.IsEmpty() {
+			ph := i.Prompt.Placeholder
+			if i.Prompt.UseAlt {
+				ph = i.Prompt.AltPlaceholder
+			}
+			fmt.Printf(ColorGrey + ph + fmt.Sprintf(CursorLeftN, len(ph)) + ColorDefault)
+		}
+
+		r, err := i.Terminal.Read()
+
+		if buf.IsEmpty() {
+			fmt.Print(ClearToEOL)
+		}
+
+		if err != nil {
+			return "", io.EOF
+		}
+
+		if escex {
+			escex = false
+
+			switch r {
+			case KeyUp:
+				if i.History.Pos > 0 {
+					if i.History.Pos == i.History.Size() {
+						currentLineBuf = []rune(buf.String())
+					}
+					buf.Replace(i.History.Prev())
+				}
+			case KeyDown:
+				if i.History.Pos < i.History.Size() {
+					buf.Replace(i.History.Next())
+					if i.History.Pos == i.History.Size() {
+						buf.Replace(currentLineBuf)
+					}
+				}
+			case KeyLeft:
+				buf.MoveLeft()
+			case KeyRight:
+				buf.MoveRight()
+			case CharBracketedPaste:
+				var code string
+				for cnt := 0; cnt < 3; cnt++ {
+					r, err = i.Terminal.Read()
+					if err != nil {
+						return "", io.EOF
+					}
+
+					code += string(r)
+				}
+				if code == CharBracketedPasteStart {
+					pasteMode = PasteModeStart
+				} else if code == CharBracketedPasteEnd {
+					pasteMode = PasteModeEnd
+				}
+			case KeyDel:
+				if buf.Size() > 0 {
+					buf.Delete()
+				}
+				metaDel = true
+			case MetaStart:
+				buf.MoveToStart()
+			case MetaEnd:
+				buf.MoveToEnd()
+			default:
+				// skip any keys we don't know about
+				continue
+			}
+			continue
+		} else if esc {
+			esc = false
+
+			switch r {
+			case 'b':
+				buf.MoveLeftWord()
+			case 'f':
+				buf.MoveRightWord()
+			case CharEscapeEx:
+				escex = true
+			}
+			continue
+		}
+
+		switch r {
+		case CharNull:
+			continue
+		case CharEsc:
+			esc = true
+		case CharInterrupt:
+			return "", ErrInterrupt
+		case CharLineStart:
+			buf.MoveToStart()
+		case CharLineEnd:
+			buf.MoveToEnd()
+		case CharBackward:
+			buf.MoveLeft()
+		case CharForward:
+			buf.MoveRight()
+		case CharBackspace, CharCtrlH:
+			buf.Remove()
+		case CharTab:
+			// todo: convert back to real tabs
+			for cnt := 0; cnt < 8; cnt++ {
+				buf.Add(' ')
+			}
+		case CharDelete:
+			if buf.Size() > 0 {
+				buf.Delete()
+			} else {
+				return "", io.EOF
+			}
+		case CharKill:
+			buf.DeleteRemaining()
+		case CharCtrlU:
+			buf.DeleteBefore()
+		case CharCtrlL:
+			buf.ClearScreen()
+		case CharCtrlW:
+			buf.DeleteWord()
+		case CharEnter:
+			output := buf.String()
+			if output != "" {
+				i.History.Add([]rune(output))
+			}
+			buf.MoveToEnd()
+			fmt.Println()
+			switch pasteMode {
+			case PasteModeStart:
+				output = `"""` + output
+			case PasteModeEnd:
+				output = output + `"""`
+			}
+			return output, nil
+		default:
+			if metaDel {
+				metaDel = false
+				continue
+			}
+			if r >= CharSpace || r == CharEnter {
+				buf.Add(r)
+			}
+		}
+	}
+}
+
+func (i *Instance) HistoryEnable() {
+	i.History.Enabled = true
+}
+
+func (i *Instance) HistoryDisable() {
+	i.History.Enabled = false
+}
+
+func NewTerminal() (*Terminal, error) {
+	t := &Terminal{
+		outchan: make(chan rune),
+	}
+
+	go t.ioloop()
+
+	return t, nil
+}
+
+func (t *Terminal) ioloop() {
+	buf := bufio.NewReader(os.Stdin)
+
+	for {
+		r, _, err := buf.ReadRune()
+		if err != nil {
+			close(t.outchan)
+			break
+		}
+		t.outchan <- r
+	}
+}
+
+func (t *Terminal) Read() (rune, error) {
+	r, ok := <-t.outchan
+	if !ok {
+		return 0, io.EOF
+	}
+
+	return r, nil
+}
--- a/readline/term.go
+++ b/readline/term.go
@@ -0,0 +1,36 @@
+//go:build aix || darwin || dragonfly || freebsd || (linux && !appengine) || netbsd || openbsd || os400 || solaris
+
+package readline
+
+import (
+	"syscall"
+)
+
+type Termios syscall.Termios
+
+func SetRawMode(fd int) (*Termios, error) {
+	termios, err := getTermios(fd)
+	if err != nil {
+		return nil, err
+	}
+
+	newTermios := *termios
+	newTermios.Iflag &^= syscall.IGNBRK | syscall.BRKINT | syscall.PARMRK | syscall.ISTRIP | syscall.INLCR | syscall.IGNCR | syscall.ICRNL | syscall.IXON
+	newTermios.Lflag &^= syscall.ECHO | syscall.ECHONL | syscall.ICANON | syscall.ISIG | syscall.IEXTEN
+	newTermios.Cflag &^= syscall.CSIZE | syscall.PARENB
+	newTermios.Cflag |= syscall.CS8
+	newTermios.Cc[syscall.VMIN] = 1
+	newTermios.Cc[syscall.VTIME] = 0
+
+	return termios, setTermios(fd, &newTermios)
+}
+
+func UnsetRawMode(fd int, termios *Termios) error {
+	return setTermios(fd, termios)
+}
+
+// IsTerminal returns true if the given file descriptor is a terminal.
+func IsTerminal(fd int) bool {
+	_, err := getTermios(fd)
+	return err == nil
+}
--- a/readline/term_bsd.go
+++ b/readline/term_bsd.go
@@ -0,0 +1,25 @@
+//go:build darwin || freebsd || netbsd || openbsd
+
+package readline
+
+import (
+	"syscall"
+	"unsafe"
+)
+
+func getTermios(fd int) (*Termios, error) {
+	termios := new(Termios)
+	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), syscall.TIOCGETA, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
+	if err != 0 {
+		return nil, err
+	}
+	return termios, nil
+}
+
+func setTermios(fd int, termios *Termios) error {
+	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), syscall.TIOCSETA, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
+	if err != 0 {
+		return err
+	}
+	return nil
+}
--- a/readline/term_linux.go
+++ b/readline/term_linux.go
@@ -0,0 +1,28 @@
+//go:build linux || solaris
+
+package readline
+
+import (
+	"syscall"
+	"unsafe"
+)
+
+const tcgets = 0x5401
+const tcsets = 0x5402
+
+func getTermios(fd int) (*Termios, error) {
+	termios := new(Termios)
+	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), tcgets, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
+	if err != 0 {
+		return nil, err
+	}
+	return termios, nil
+}
+
+func setTermios(fd int, termios *Termios) error {
+	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), tcsets, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
+	if err != 0 {
+		return err
+	}
+	return nil
+}
--- a/readline/term_windows.go
+++ b/readline/term_windows.go
@@ -0,0 +1,62 @@
+package readline
+
+import (
+	"syscall"
+	"unsafe"
+)
+
+const (
+	enableLineInput       = 2
+	enableWindowInput     = 8
+	enableMouseInput      = 16
+	enableInsertMode      = 32
+	enableQuickEditMode   = 64
+	enableExtendedFlags   = 128
+	enableProcessedOutput = 1
+	enableWrapAtEolOutput = 2
+	enableAutoPosition    = 256 // Cursor position is not affected by writing data to the console.
+	enableEchoInput       = 4   // Characters are written to the console as they're read.
+	enableProcessedInput  = 1   // Enables input processing (like recognizing Ctrl+C).
+)
+
+var kernel32 = syscall.NewLazyDLL("kernel32.dll")
+
+var (
+	procGetConsoleMode = kernel32.NewProc("GetConsoleMode")
+	procSetConsoleMode = kernel32.NewProc("SetConsoleMode")
+)
+
+type State struct {
+	mode uint32
+}
+
+// IsTerminal checks if the given file descriptor is associated with a terminal
+func IsTerminal(fd int) bool {
+	var st uint32
+	r, _, e := syscall.SyscallN(procGetConsoleMode.Addr(), uintptr(fd), uintptr(unsafe.Pointer(&st)), 0)
+	// if the call succeeds and doesn't produce an error, it's a terminal
+	return r != 0 && e == 0
+}
+
+func SetRawMode(fd int) (*State, error) {
+	var st uint32
+	// retrieve the current mode of the terminal
+	_, _, e := syscall.SyscallN(procGetConsoleMode.Addr(), uintptr(fd), uintptr(unsafe.Pointer(&st)), 0)
+	if e != 0 {
+		return nil, error(e)
+	}
+	// modify the mode to set it to raw
+	raw := st &^ (enableEchoInput | enableProcessedInput | enableLineInput | enableProcessedOutput)
+	// apply the new mode to the terminal
+	_, _, e = syscall.SyscallN(procSetConsoleMode.Addr(), uintptr(fd), uintptr(raw), 0)
+	if e != 0 {
+		return nil, error(e)
+	}
+	// return the original state so that it can be restored later
+	return &State{st}, nil
+}
+
+func UnsetRawMode(fd int, state *State) error {
+	_, _, err := syscall.SyscallN(procSetConsoleMode.Addr(), uintptr(fd), uintptr(state.mode), 0)
+	return err
+}
--- a/readline/types.go
+++ b/readline/types.go
@@ -0,0 +1,86 @@
+package readline
+
+const (
+	CharNull      = 0
+	CharLineStart = 1
+	CharBackward  = 2
+	CharInterrupt = 3
+	CharDelete    = 4
+	CharLineEnd   = 5
+	CharForward   = 6
+	CharBell      = 7
+	CharCtrlH     = 8
+	CharTab       = 9
+	CharCtrlJ     = 10
+	CharKill      = 11
+	CharCtrlL     = 12
+	CharEnter     = 13
+	CharNext      = 14
+	CharPrev      = 16
+	CharBckSearch = 18
+	CharFwdSearch = 19
+	CharTranspose = 20
+	CharCtrlU     = 21
+	CharCtrlW     = 23
+	CharCtrlY     = 25
+	CharCtrlZ     = 26
+	CharEsc       = 27
+	CharSpace     = 32
+	CharEscapeEx  = 91
+	CharBackspace = 127
+)
+
+const (
+	KeyDel    = 51
+	KeyUp     = 65
+	KeyDown   = 66
+	KeyRight  = 67
+	KeyLeft   = 68
+	MetaEnd   = 70
+	MetaStart = 72
+)
+
+const (
+	CursorUp    = "\033[1A"
+	CursorDown  = "\033[1B"
+	CursorRight = "\033[1C"
+	CursorLeft  = "\033[1D"
+
+	CursorSave    = "\033[s"
+	CursorRestore = "\033[u"
+
+	CursorUpN    = "\033[%dA"
+	CursorDownN  = "\033[%dB"
+	CursorRightN = "\033[%dC"
+	CursorLeftN  = "\033[%dD"
+
+	CursorEOL  = "\033[E"
+	CursorBOL  = "\033[1G"
+	CursorHide = "\033[?25l"
+	CursorShow = "\033[?25h"
+
+	ClearToEOL  = "\033[K"
+	ClearLine   = "\033[2K"
+	ClearScreen = "\033[2J"
+	CursorReset = "\033[0;0f"
+
+	ColorGrey    = "\033[38;5;245m"
+	ColorDefault = "\033[0m"
+
+	StartBracketedPaste = "\033[?2004h"
+	EndBracketedPaste   = "\033[?2004l"
+)
+
+const (
+	CharBracketedPaste      = 50
+	CharBracketedPasteStart = "00~"
+	CharBracketedPasteEnd   = "01~"
+)
+
+type PasteMode int
+
+const (
+	PastModeOff = iota
+	PasteModeStart
+	PasteModeEnd
+)
--- a/scripts/install.sh
+++ b/scripts/install.sh
@@ -63,7 +63,10 @@ status "Installing ollama to $BINDIR..."
 $SUDO install -o0 -g0 -m755 -d $BINDIR
 $SUDO install -o0 -g0 -m755 $TEMP_DIR/ollama $BINDIR/ollama

-install_success() { status 'Install complete. Run "ollama" from the command line.'; }
+install_success() { 
+    status 'The Ollama API is now available at 0.0.0.0:11434.'
+    status 'Install complete. Run "ollama" from the command line.'
+}
 trap install_success EXIT

 # Everything from this point onwards is optional.
@@ -74,6 +77,9 @@ configure_systemd() {
        $SUDO useradd -r -s /bin/false -m -d /usr/share/ollama ollama
    fi

+    status "Adding current user to ollama group..."
+    $SUDO usermod -a -G ollama $(whoami)
+
    status "Creating ollama systemd service..."
    cat <<EOF | $SUDO tee /etc/systemd/system/ollama.service >/dev/null
 [Unit]
@@ -86,7 +92,6 @@ User=ollama
 Group=ollama
 Restart=always
 RestartSec=3
-Environment="HOME=/usr/share/ollama"
 Environment="PATH=$PATH"

 [Install]
@@ -128,6 +133,7 @@ if check_gpu nvidia-smi; then
 fi

 if ! check_gpu lspci && ! check_gpu lshw; then
+    install_success
    warning "No NVIDIA GPU detected. Ollama will run in CPU-only mode."
    exit 0
 fi
@@ -174,7 +180,7 @@ install_cuda_driver_apt() {
    case $1 in
        debian)
            status 'Enabling contrib sources...'
-            $SUDO sed 's/main/contrib/' < /etc/apt/sources.list | sudo tee /etc/apt/sources.list.d/contrib.list > /dev/null
+            $SUDO sed 's/main/contrib/' < /etc/apt/sources.list | $SUDO tee /etc/apt/sources.list.d/contrib.list > /dev/null
            ;;
    esac

--- a/server/auth.go
+++ b/server/auth.go
@@ -91,7 +91,7 @@ func getAuthToken(ctx context.Context, redirData AuthRedirect) (string, error) {
 	}

 	s := SignatureData{
-		Method: "GET",
+		Method: http.MethodGet,
 		Path:   redirectURL.String(),
 		Data:   nil,
 	}
@@ -103,7 +103,7 @@ func getAuthToken(ctx context.Context, redirData AuthRedirect) (string, error) {

 	headers := make(http.Header)
 	headers.Set("Authorization", sig)
-	resp, err := makeRequest(ctx, "GET", redirectURL, headers, nil, nil)
+	resp, err := makeRequest(ctx, http.MethodGet, redirectURL, headers, nil, nil)
 	if err != nil {
 		log.Printf("couldn't get token: %q", err)
 		return "", err
--- a/server/download.go
+++ b/server/download.go
@@ -15,6 +15,7 @@ import (
 	"strings"
 	"sync"
 	"sync/atomic"
+	"syscall"
 	"time"

 	"golang.org/x/sync/errgroup"
@@ -88,17 +89,12 @@ func (b *blobDownload) Prepare(ctx context.Context, requestURL *url.URL, opts *R
 	}

 	if len(b.Parts) == 0 {
-		resp, err := makeRequest(ctx, "HEAD", requestURL, nil, nil, opts)
+		resp, err := makeRequestWithRetry(ctx, http.MethodHead, requestURL, nil, nil, opts)
 		if err != nil {
 			return err
 		}
 		defer resp.Body.Close()

-		if resp.StatusCode >= http.StatusBadRequest {
-			body, _ := io.ReadAll(resp.Body)
-			return fmt.Errorf("registry responded with code %d: %v", resp.StatusCode, string(body))
-		}
-
 		b.Total, _ = strconv.ParseInt(resp.Header.Get("Content-Length"), 10, 64)

 		var size = b.Total / numDownloadParts
@@ -133,7 +129,6 @@ func (b *blobDownload) Run(ctx context.Context, requestURL *url.URL, opts *Regis

 func (b *blobDownload) run(ctx context.Context, requestURL *url.URL, opts *RegistryOptions) error {
 	defer blobDownloadManager.Delete(b.Digest)
-
 	ctx, b.CancelFunc = context.WithCancel(ctx)

 	file, err := os.OpenFile(b.Name+"-partial", os.O_CREATE|os.O_RDWR, 0644)
@@ -154,21 +149,26 @@ func (b *blobDownload) run(ctx context.Context, requestURL *url.URL, opts *Regis

 		i := i
 		g.Go(func() error {
+			var err error
 			for try := 0; try < maxRetries; try++ {
 				w := io.NewOffsetWriter(file, part.StartsAt())
-				err := b.downloadChunk(inner, requestURL, w, part, opts)
+				err = b.downloadChunk(inner, requestURL, w, part, opts)
 				switch {
-				case errors.Is(err, context.Canceled):
+				case errors.Is(err, context.Canceled), errors.Is(err, syscall.ENOSPC):
+					// return immediately if the context is canceled or the device is out of space
 					return err
 				case err != nil:
 					log.Printf("%s part %d attempt %d failed: %v, retrying", b.Digest[7:19], i, try, err)
 					continue
 				default:
+					if try > 0 {
+						log.Printf("%s part %d completed after %d retries", b.Digest[7:19], i, try)
+					}
 					return nil
 				}
 			}

-			return errors.New("max retries exceeded")
+			return fmt.Errorf("%w: %w", errMaxRetriesExceeded, err)
 		})
 	}

@@ -198,14 +198,14 @@ func (b *blobDownload) run(ctx context.Context, requestURL *url.URL, opts *Regis
 func (b *blobDownload) downloadChunk(ctx context.Context, requestURL *url.URL, w io.Writer, part *blobDownloadPart, opts *RegistryOptions) error {
 	headers := make(http.Header)
 	headers.Set("Range", fmt.Sprintf("bytes=%d-%d", part.StartsAt(), part.StopsAt()-1))
-	resp, err := makeRequest(ctx, "GET", requestURL, headers, nil, opts)
+	resp, err := makeRequestWithRetry(ctx, http.MethodGet, requestURL, headers, nil, opts)
 	if err != nil {
 		return err
 	}
 	defer resp.Body.Close()

 	n, err := io.Copy(w, io.TeeReader(resp.Body, b))
-	if err != nil && !errors.Is(err, context.Canceled) {
+	if err != nil && !errors.Is(err, context.Canceled) && !errors.Is(err, io.ErrUnexpectedEOF) {
 		// rollback progress
 		b.Completed.Add(-n)
 		return err
@@ -216,7 +216,7 @@ func (b *blobDownload) downloadChunk(ctx context.Context, requestURL *url.URL, w
 		return err
 	}

-	// return nil or context.Canceled
+	// return nil or context.Canceled or UnexpectedEOF (resumable)
 	return err
 }

@@ -306,6 +306,8 @@ type downloadOpts struct {

 const maxRetries = 3

+var errMaxRetriesExceeded = errors.New("max retries exceeded")
+
 // downloadBlob downloads a blob from the registry and stores it in the blobs directory
 func downloadBlob(ctx context.Context, opts downloadOpts) error {
 	fp, err := GetBlobsPath(opts.digest)
--- a/server/images.go
+++ b/server/images.go
@@ -60,18 +60,12 @@ func (m *Model) Prompt(request api.GenerateRequest) (string, error) {
 	}

 	var vars struct {
-		First  bool
 		System string
 		Prompt string
-
-		// deprecated: versions <= 0.0.7 used this to omit the system prompt
-		Context []int
 	}

-	vars.First = len(request.Context) == 0
 	vars.System = m.System
 	vars.Prompt = request.Prompt
-	vars.Context = request.Context

 	if request.System != "" {
 		vars.System = request.System
@@ -131,7 +125,7 @@ func (m *ManifestV2) GetTotalSize() (total int64) {
 }

 func GetManifest(mp ModelPath) (*ManifestV2, string, error) {
-	fp, err := mp.GetManifestPath(false)
+	fp, err := mp.GetManifestPath()
 	if err != nil {
 		return nil, "", err
 	}
@@ -401,7 +395,7 @@ func CreateModel(ctx context.Context, name string, path string, fn func(resp api
 					if err != nil {
 						return err
 					}
-					newLayer.From = mp.GetNamespaceRepository()
+					newLayer.From = mp.GetShortTagname()
 					layers = append(layers, newLayer)
 				}
 			}
@@ -595,10 +589,13 @@ func CreateManifest(name string, cfg *LayerReader, layers []*Layer) error {
 		return err
 	}

-	fp, err := mp.GetManifestPath(true)
+	fp, err := mp.GetManifestPath()
 	if err != nil {
 		return err
 	}
+	if err := os.MkdirAll(filepath.Dir(fp), 0o755); err != nil {
+		return err
+	}
 	return os.WriteFile(fp, manifestJSON, 0o644)
 }

@@ -710,16 +707,19 @@ func CreateLayer(f io.ReadSeeker) (*LayerReader, error) {

 func CopyModel(src, dest string) error {
 	srcModelPath := ParseModelPath(src)
-	srcPath, err := srcModelPath.GetManifestPath(false)
+	srcPath, err := srcModelPath.GetManifestPath()
 	if err != nil {
 		return err
 	}

 	destModelPath := ParseModelPath(dest)
-	destPath, err := destModelPath.GetManifestPath(true)
+	destPath, err := destModelPath.GetManifestPath()
 	if err != nil {
 		return err
 	}
+	if err := os.MkdirAll(filepath.Dir(destPath), 0o755); err != nil {
+		return err
+	}

 	// copy the file
 	input, err := os.ReadFile(srcPath)
@@ -882,7 +882,7 @@ func DeleteModel(name string) error {
 		return err
 	}

-	fp, err := mp.GetManifestPath(false)
+	fp, err := mp.GetManifestPath()
 	if err != nil {
 		return err
 	}
@@ -975,46 +975,7 @@ func PushModel(ctx context.Context, name string, regOpts *RegistryOptions, fn fu
 	layers = append(layers, &manifest.Config)

 	for _, layer := range layers {
-		exists, err := checkBlobExistence(ctx, mp, layer.Digest, regOpts)
-		if err != nil {
-			return err
-		}
-
-		if exists {
-			fn(api.ProgressResponse{
-				Status:    "using existing layer",
-				Digest:    layer.Digest,
-				Total:     layer.Size,
-				Completed: layer.Size,
-			})
-			log.Printf("Layer %s already exists", layer.Digest)
-			continue
-		}
-
-		fn(api.ProgressResponse{
-			Status: "starting upload",
-			Digest: layer.Digest,
-			Total:  layer.Size,
-		})
-
-		location, chunkSize, err := startUpload(ctx, mp, layer, regOpts)
-		if err != nil {
-			log.Printf("couldn't start upload: %v", err)
-			return err
-		}
-
-		if strings.HasPrefix(filepath.Base(location.Path), "sha256:") {
-			layer.Digest = filepath.Base(location.Path)
-			fn(api.ProgressResponse{
-				Status:    "using existing layer",
-				Digest:    layer.Digest,
-				Total:     layer.Size,
-				Completed: layer.Size,
-			})
-			continue
-		}
-
-		if err := uploadBlob(ctx, location, layer, chunkSize, regOpts, fn); err != nil {
+		if err := uploadBlob(ctx, mp, layer, regOpts, fn); err != nil {
 			log.Printf("error uploading blob: %v", err)
 			return err
 		}
@@ -1031,7 +992,7 @@ func PushModel(ctx context.Context, name string, regOpts *RegistryOptions, fn fu

 	headers := make(http.Header)
 	headers.Set("Content-Type", "application/vnd.docker.distribution.manifest.v2+json")
-	resp, err := makeRequestWithRetry(ctx, "PUT", requestURL, headers, bytes.NewReader(manifestJSON), regOpts)
+	resp, err := makeRequestWithRetry(ctx, http.MethodPut, requestURL, headers, bytes.NewReader(manifestJSON), regOpts)
 	if err != nil {
 		return err
 	}
@@ -1121,10 +1082,13 @@ func PullModel(ctx context.Context, name string, regOpts *RegistryOptions, fn fu
 		return err
 	}

-	fp, err := mp.GetManifestPath(true)
+	fp, err := mp.GetManifestPath()
 	if err != nil {
 		return err
 	}
+	if err := os.MkdirAll(filepath.Dir(fp), 0o755); err != nil {
+		return err
+	}

 	err = os.WriteFile(fp, manifestJSON, 0o644)
 	if err != nil {
@@ -1150,22 +1114,12 @@ func pullModelManifest(ctx context.Context, mp ModelPath, regOpts *RegistryOptio

 	headers := make(http.Header)
 	headers.Set("Accept", "application/vnd.docker.distribution.manifest.v2+json")
-	resp, err := makeRequest(ctx, "GET", requestURL, headers, nil, regOpts)
+	resp, err := makeRequestWithRetry(ctx, http.MethodGet, requestURL, headers, nil, regOpts)
 	if err != nil {
-		log.Printf("couldn't get manifest: %v", err)
 		return nil, err
 	}
 	defer resp.Body.Close()

-	if resp.StatusCode >= http.StatusBadRequest {
-		if resp.StatusCode == http.StatusNotFound {
-			return nil, fmt.Errorf("model not found")
-		}
-
-		body, _ := io.ReadAll(resp.Body)
-		return nil, fmt.Errorf("on pull registry responded with code %d: %s", resp.StatusCode, body)
-	}
-
 	var m *ManifestV2
 	if err := json.NewDecoder(resp.Body).Decode(&m); err != nil {
 		return nil, err
@@ -1209,24 +1163,7 @@ func GetSHA256Digest(r io.Reader) (string, int64) {
 	return fmt.Sprintf("sha256:%x", h.Sum(nil)), n
 }

-// Function to check if a blob already exists in the Docker registry
-func checkBlobExistence(ctx context.Context, mp ModelPath, digest string, regOpts *RegistryOptions) (bool, error) {
-	requestURL := mp.BaseURL()
-	requestURL = requestURL.JoinPath("v2", mp.GetNamespaceRepository(), "blobs", digest)
-
-	resp, err := makeRequest(ctx, "HEAD", requestURL, nil, nil, regOpts)
-	if err != nil {
-		log.Printf("couldn't check for blob: %v", err)
-		return false, err
-	}
-	defer resp.Body.Close()
-
-	// Check for success: If the blob exists, the Docker registry will respond with a 200 OK
-	return resp.StatusCode < http.StatusBadRequest, nil
-}
-
 func makeRequestWithRetry(ctx context.Context, method string, requestURL *url.URL, headers http.Header, body io.ReadSeeker, regOpts *RegistryOptions) (*http.Response, error) {
-	var status string
 	for try := 0; try < maxRetries; try++ {
 		resp, err := makeRequest(ctx, method, requestURL, headers, body, regOpts)
 		if err != nil {
@@ -1234,8 +1171,6 @@ func makeRequestWithRetry(ctx context.Context, method string, requestURL *url.UR
 			return nil, err
 		}

-		status = resp.Status
-
 		switch {
 		case resp.StatusCode == http.StatusUnauthorized:
 			auth := resp.Header.Get("www-authenticate")
@@ -1247,21 +1182,25 @@ func makeRequestWithRetry(ctx context.Context, method string, requestURL *url.UR

 			regOpts.Token = token
 			if body != nil {
-				if _, err := body.Seek(0, io.SeekStart); err != nil {
-					return nil, err
-				}
+				body.Seek(0, io.SeekStart)
 			}

 			continue
+		case resp.StatusCode == http.StatusNotFound:
+			return nil, os.ErrNotExist
 		case resp.StatusCode >= http.StatusBadRequest:
-			body, _ := io.ReadAll(resp.Body)
-			return nil, fmt.Errorf("on upload registry responded with code %d: %s", resp.StatusCode, body)
+			body, err := io.ReadAll(resp.Body)
+			if err != nil {
+				return nil, fmt.Errorf("%d: %s", resp.StatusCode, err)
+			}
+
+			return nil, fmt.Errorf("%d: %s", resp.StatusCode, body)
 		default:
 			return resp, nil
 		}
 	}

-	return nil, fmt.Errorf("max retry exceeded: %v", status)
+	return nil, errMaxRetriesExceeded
 }

 func makeRequest(ctx context.Context, method string, requestURL *url.URL, headers http.Header, body io.Reader, regOpts *RegistryOptions) (*http.Response, error) {
--- a/server/modelpath.go
+++ b/server/modelpath.go
@@ -85,20 +85,27 @@ func (mp ModelPath) GetShortTagname() string {
 	return fmt.Sprintf("%s/%s/%s:%s", mp.Registry, mp.Namespace, mp.Repository, mp.Tag)
 }

-func (mp ModelPath) GetManifestPath(createDir bool) (string, error) {
+// modelsDir returns the value of the OLLAMA_MODELS environment variable or the user's home directory if OLLAMA_MODELS is not set.
+// The models directory is where Ollama stores its model files and manifests.
+func modelsDir() (string, error) {
+	if models, exists := os.LookupEnv("OLLAMA_MODELS"); exists {
+		return models, nil
+	}
 	home, err := os.UserHomeDir()
 	if err != nil {
 		return "", err
 	}
+	return filepath.Join(home, ".ollama", "models"), nil
+}

-	path := filepath.Join(home, ".ollama", "models", "manifests", mp.Registry, mp.Namespace, mp.Repository, mp.Tag)
-	if createDir {
-		if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
-			return "", err
-		}
+// GetManifestPath returns the path to the manifest file for the given model path, it is up to the caller to create the directory if it does not exist.
+func (mp ModelPath) GetManifestPath() (string, error) {
+	dir, err := modelsDir()
+	if err != nil {
+		return "", err
 	}

-	return path, nil
+	return filepath.Join(dir, "manifests", mp.Registry, mp.Namespace, mp.Repository, mp.Tag), nil
 }

 func (mp ModelPath) BaseURL() *url.URL {
@@ -109,12 +116,12 @@ func (mp ModelPath) BaseURL() *url.URL {
 }

 func GetManifestPath() (string, error) {
-	home, err := os.UserHomeDir()
+	dir, err := modelsDir()
 	if err != nil {
 		return "", err
 	}

-	path := filepath.Join(home, ".ollama", "models", "manifests")
+	path := filepath.Join(dir, "manifests")
 	if err := os.MkdirAll(path, 0o755); err != nil {
 		return "", err
 	}
@@ -123,7 +130,7 @@ func GetManifestPath() (string, error) {
 }

 func GetBlobsPath(digest string) (string, error) {
-	home, err := os.UserHomeDir()
+	dir, err := modelsDir()
 	if err != nil {
 		return "", err
 	}
@@ -132,7 +139,7 @@ func GetBlobsPath(digest string) (string, error) {
 		digest = strings.ReplaceAll(digest, ":", "-")
 	}

-	path := filepath.Join(home, ".ollama", "models", "blobs", digest)
+	path := filepath.Join(dir, "blobs", digest)
 	dirPath := filepath.Dir(path)
 	if digest == "" {
 		dirPath = path
--- a/server/routes.go
+++ b/server/routes.go
@@ -158,9 +158,17 @@ func GenerateHandler(c *gin.Context) {
 		return
 	}

-	if req.Model == "" {
+	// validate the request
+	switch {
+	case req.Model == "":
 		c.AbortWithStatusJSON(http.StatusBadRequest, gin.H{"error": "model is required"})
 		return
+	case len(req.Format) > 0 && req.Format != "json":
+		c.AbortWithStatusJSON(http.StatusBadRequest, gin.H{"error": "format must be json"})
+		return
+	case req.Raw && (req.Template != "" || req.System != "" || len(req.Context) > 0):
+		c.AbortWithStatusJSON(http.StatusBadRequest, gin.H{"error": "raw mode does not support template, system, or context"})
+		return
 	}

 	model, err := GetModel(req.Model)
@@ -189,10 +197,13 @@ func GenerateHandler(c *gin.Context) {

 	checkpointLoaded := time.Now()

-	prompt, err := model.Prompt(req)
-	if err != nil {
-		c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
-		return
+	prompt := req.Prompt
+	if !req.Raw {
+		prompt, err = model.Prompt(req)
+		if err != nil {
+			c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
+			return
+		}
 	}

 	ch := make(chan any)
@@ -215,10 +226,15 @@ func GenerateHandler(c *gin.Context) {
 				r.LoadDuration = checkpointLoaded.Sub(checkpointStart)
 			}

+			if req.Raw {
+				// in raw mode the client must manage history on their own
+				r.Context = nil
+			}
+
 			ch <- r
 		}

-		if err := loaded.runner.Predict(c.Request.Context(), req.Context, prompt, fn); err != nil {
+		if err := loaded.runner.Predict(c.Request.Context(), req.Context, prompt, req.Format, fn); err != nil {
 			ch <- gin.H{"error": err.Error()}
 		}
 	}()
@@ -365,7 +381,9 @@ func PushModelHandler(c *gin.Context) {
 			Insecure: req.Insecure,
 		}

-		ctx := context.Background()
+		ctx, cancel := context.WithCancel(c.Request.Context())
+		defer cancel()
+
 		if err := PushModel(ctx, req.Name, regOpts, fn); err != nil {
 			ch <- gin.H{"error": err.Error()}
 		}
@@ -614,6 +632,22 @@ var defaultAllowOrigins = []string{
 }

 func Serve(ln net.Listener, allowOrigins []string) error {
+	if noprune := os.Getenv("OLLAMA_NOPRUNE"); noprune == "" {
+		// clean up unused layers and manifests
+		if err := PruneLayers(); err != nil {
+			return err
+		}
+
+		manifestsPath, err := GetManifestPath()
+		if err != nil {
+			return err
+		}
+
+		if err := PruneDirectory(manifestsPath); err != nil {
+			return err
+		}
+	}
+
 	config := cors.DefaultConfig()
 	config.AllowWildcard = true

@@ -679,7 +713,7 @@ func Serve(ln net.Listener, allowOrigins []string) error {
 	if runtime.GOOS == "linux" {
 		// check compatibility to log warnings
 		if _, err := llm.CheckVRAM(); err != nil {
-			log.Printf("Warning: GPU support may not enabled, check you have installed install GPU drivers: %v", err)
+			log.Printf("Warning: GPU support may not be enabled, check you have installed GPU drivers: %v", err)
 		}
 	}

--- a/server/upload.go
+++ b/server/upload.go
@@ -2,218 +2,369 @@ package server

 import (
 	"context"
+	"crypto/md5"
 	"errors"
 	"fmt"
+	"hash"
 	"io"
 	"log"
 	"net/http"
 	"net/url"
 	"os"
-	"strconv"
+	"strings"
 	"sync"
+	"sync/atomic"
+	"time"

 	"github.com/jmorganca/ollama/api"
+	"github.com/jmorganca/ollama/format"
+	"golang.org/x/sync/errgroup"
 )

+var blobUploadManager sync.Map
+
+type blobUpload struct {
+	*Layer
+
+	Total     int64
+	Completed atomic.Int64
+
+	Parts []blobUploadPart
+
+	nextURL chan *url.URL
+
+	context.CancelFunc
+
+	done       bool
+	err        error
+	references atomic.Int32
+}
+
 const (
-	redirectChunkSize int64 = 1024 * 1024 * 1024
-	regularChunkSize  int64 = 95 * 1024 * 1024
+	numUploadParts          = 64
+	minUploadPartSize int64 = 95 * 1000 * 1000
+	maxUploadPartSize int64 = 1000 * 1000 * 1000
 )

-func startUpload(ctx context.Context, mp ModelPath, layer *Layer, regOpts *RegistryOptions) (*url.URL, int64, error) {
-	requestURL := mp.BaseURL()
-	requestURL = requestURL.JoinPath("v2", mp.GetNamespaceRepository(), "blobs/uploads/")
-	if layer.From != "" {
+func (b *blobUpload) Prepare(ctx context.Context, requestURL *url.URL, opts *RegistryOptions) error {
+	p, err := GetBlobsPath(b.Digest)
+	if err != nil {
+		return err
+	}
+
+	if b.From != "" {
 		values := requestURL.Query()
-		values.Add("mount", layer.Digest)
-		values.Add("from", layer.From)
+		values.Add("mount", b.Digest)
+		values.Add("from", b.From)
 		requestURL.RawQuery = values.Encode()
 	}

-	resp, err := makeRequestWithRetry(ctx, "POST", requestURL, nil, nil, regOpts)
+	resp, err := makeRequestWithRetry(ctx, http.MethodPost, requestURL, nil, nil, opts)
 	if err != nil {
-		log.Printf("couldn't start upload: %v", err)
-		return nil, 0, err
+		return err
 	}
 	defer resp.Body.Close()

 	location := resp.Header.Get("Docker-Upload-Location")
-	chunkSize := redirectChunkSize
 	if location == "" {
 		location = resp.Header.Get("Location")
-		chunkSize = regularChunkSize
 	}

-	locationURL, err := url.Parse(location)
+	fi, err := os.Stat(p)
 	if err != nil {
-		return nil, 0, err
+		return err
 	}

-	return locationURL, chunkSize, nil
+	b.Total = fi.Size()
+
+	var size = b.Total / numUploadParts
+	switch {
+	case size < minUploadPartSize:
+		size = minUploadPartSize
+	case size > maxUploadPartSize:
+		size = maxUploadPartSize
+	}
+
+	var offset int64
+	for offset < fi.Size() {
+		if offset+size > fi.Size() {
+			size = fi.Size() - offset
+		}
+
+		// set part.N to the current number of parts
+		b.Parts = append(b.Parts, blobUploadPart{blobUpload: b, N: len(b.Parts), Offset: offset, Size: size})
+		offset += size
+	}
+
+	log.Printf("uploading %s in %d %s part(s)", b.Digest[7:19], len(b.Parts), format.HumanBytes(b.Parts[0].Size))
+
+	requestURL, err = url.Parse(location)
+	if err != nil {
+		return err
+	}
+
+	b.nextURL = make(chan *url.URL, 1)
+	b.nextURL <- requestURL
+	return nil
 }

-func uploadBlob(ctx context.Context, requestURL *url.URL, layer *Layer, chunkSize int64, regOpts *RegistryOptions, fn func(api.ProgressResponse)) error {
-	// TODO allow resumability
-	// TODO allow canceling uploads via DELETE
+// Run uploads blob parts to the upstream. If the upstream supports redirection, parts will be uploaded
+// in parallel as defined by Prepare. Otherwise, parts will be uploaded serially. Run sets b.err on error.
+func (b *blobUpload) Run(ctx context.Context, opts *RegistryOptions) {
+	defer blobUploadManager.Delete(b.Digest)
+	ctx, b.CancelFunc = context.WithCancel(ctx)

-	fp, err := GetBlobsPath(layer.Digest)
+	p, err := GetBlobsPath(b.Digest)
 	if err != nil {
-		return err
+		b.err = err
+		return
 	}

-	f, err := os.Open(fp)
+	f, err := os.Open(p)
 	if err != nil {
-		return err
+		b.err = err
+		return
 	}
 	defer f.Close()

-	pw := ProgressWriter{
-		status: fmt.Sprintf("uploading %s", layer.Digest),
-		digest: layer.Digest,
-		total:  layer.Size,
-		fn:     fn,
-	}
+	g, inner := errgroup.WithContext(ctx)
+	g.SetLimit(numUploadParts)
+	for i := range b.Parts {
+		part := &b.Parts[i]
+		select {
+		case <-inner.Done():
+		case requestURL := <-b.nextURL:
+			g.Go(func() error {
+				var err error
+				for try := 0; try < maxRetries; try++ {
+					part.ReadSeeker = io.NewSectionReader(f, part.Offset, part.Size)
+					err = b.uploadChunk(inner, http.MethodPatch, requestURL, part, opts)
+					switch {
+					case errors.Is(err, context.Canceled):
+						return err
+					case errors.Is(err, errMaxRetriesExceeded):
+						return err
+					case err != nil:
+						log.Printf("%s part %d attempt %d failed: %v, retrying", b.Digest[7:19], part.N, try, err)
+						continue
+					}

-	for offset := int64(0); offset < layer.Size; {
-		chunk := layer.Size - offset
-		if chunk > chunkSize {
-			chunk = chunkSize
-		}
+					return nil
+				}

-		resp, err := uploadBlobChunk(ctx, http.MethodPatch, requestURL, f, offset, chunk, regOpts, &pw)
-		if err != nil {
-			fn(api.ProgressResponse{
-				Status:    fmt.Sprintf("error uploading chunk: %v", err),
-				Digest:    layer.Digest,
-				Total:     layer.Size,
-				Completed: offset,
+				return fmt.Errorf("%w: %w", errMaxRetriesExceeded, err)
 			})
-
-			return err
-		}
-
-		offset += chunk
-		location := resp.Header.Get("Docker-Upload-Location")
-		if location == "" {
-			location = resp.Header.Get("Location")
-		}
-
-		requestURL, err = url.Parse(location)
-		if err != nil {
-			return err
 		}
 	}

+	if err := g.Wait(); err != nil {
+		b.err = err
+		return
+	}
+
+	requestURL := <-b.nextURL
+
+	var sb strings.Builder
+	for _, part := range b.Parts {
+		sb.Write(part.Sum(nil))
+	}
+
+	md5sum := md5.Sum([]byte(sb.String()))
+
 	values := requestURL.Query()
-	values.Add("digest", layer.Digest)
+	values.Add("digest", b.Digest)
+	values.Add("etag", fmt.Sprintf("%x-%d", md5sum, len(b.Parts)))
 	requestURL.RawQuery = values.Encode()

 	headers := make(http.Header)
 	headers.Set("Content-Type", "application/octet-stream")
 	headers.Set("Content-Length", "0")

-	// finish the upload
-	resp, err := makeRequest(ctx, "PUT", requestURL, headers, nil, regOpts)
+	resp, err := makeRequestWithRetry(ctx, http.MethodPut, requestURL, headers, nil, opts)
+	if err != nil {
+		b.err = err
+		return
+	}
+	defer resp.Body.Close()
+
+	b.done = true
+}
+
+func (b *blobUpload) uploadChunk(ctx context.Context, method string, requestURL *url.URL, part *blobUploadPart, opts *RegistryOptions) error {
+	part.Reset()
+
+	headers := make(http.Header)
+	headers.Set("Content-Type", "application/octet-stream")
+	headers.Set("Content-Length", fmt.Sprintf("%d", part.Size))
+	headers.Set("X-Redirect-Uploads", "1")
+
+	if method == http.MethodPatch {
+		headers.Set("Content-Range", fmt.Sprintf("%d-%d", part.Offset, part.Offset+part.Size-1))
+	}
+
+	resp, err := makeRequest(ctx, method, requestURL, headers, io.TeeReader(part.ReadSeeker, io.MultiWriter(part, part.Hash)), opts)
 	if err != nil {
-		log.Printf("couldn't finish upload: %v", err)
 		return err
 	}
 	defer resp.Body.Close()

-	if resp.StatusCode >= http.StatusBadRequest {
-		body, _ := io.ReadAll(resp.Body)
-		return fmt.Errorf("on finish upload registry responded with code %d: %v", resp.StatusCode, string(body))
-	}
-	return nil
-}
-
-func uploadBlobChunk(ctx context.Context, method string, requestURL *url.URL, r io.ReaderAt, offset, limit int64, opts *RegistryOptions, pw *ProgressWriter) (*http.Response, error) {
-	sectionReader := io.NewSectionReader(r, offset, limit)
-
-	headers := make(http.Header)
-	headers.Set("Content-Type", "application/octet-stream")
-	headers.Set("Content-Length", strconv.Itoa(int(limit)))
-	headers.Set("X-Redirect-Uploads", "1")
-
-	if method == http.MethodPatch {
-		headers.Set("Content-Range", fmt.Sprintf("%d-%d", offset, offset+sectionReader.Size()-1))
+	location := resp.Header.Get("Docker-Upload-Location")
+	if location == "" {
+		location = resp.Header.Get("Location")
 	}

-	for try := 0; try < maxRetries; try++ {
-		resp, err := makeRequest(ctx, method, requestURL, headers, io.TeeReader(sectionReader, pw), opts)
-		if err != nil && !errors.Is(err, io.EOF) {
-			return nil, err
+	nextURL, err := url.Parse(location)
+	if err != nil {
+		return err
+	}
+
+	switch {
+	case resp.StatusCode == http.StatusTemporaryRedirect:
+		b.nextURL <- nextURL
+
+		redirectURL, err := resp.Location()
+		if err != nil {
+			return err
 		}
-		defer resp.Body.Close()

-		switch {
-		case resp.StatusCode == http.StatusTemporaryRedirect:
-			location, err := resp.Location()
-			if err != nil {
-				return nil, err
-			}
-
-			pw.completed = offset
-			if _, err := uploadBlobChunk(ctx, http.MethodPut, location, r, offset, limit, nil, pw); err != nil {
-				// retry
-				log.Printf("retrying redirected upload: %v", err)
+		for try := 0; try < maxRetries; try++ {
+			err = b.uploadChunk(ctx, http.MethodPut, redirectURL, part, nil)
+			switch {
+			case errors.Is(err, context.Canceled):
+				return err
+			case errors.Is(err, errMaxRetriesExceeded):
+				return err
+			case err != nil:
+				log.Printf("%s part %d attempt %d failed: %v, retrying", b.Digest[7:19], part.N, try, err)
 				continue
 			}

-			return resp, nil
-		case resp.StatusCode == http.StatusUnauthorized:
-			auth := resp.Header.Get("www-authenticate")
-			authRedir := ParseAuthRedirectString(auth)
-			token, err := getAuthToken(ctx, authRedir)
-			if err != nil {
-				return nil, err
-			}
-
-			opts.Token = token
-
-			pw.completed = offset
-			sectionReader = io.NewSectionReader(r, offset, limit)
-			continue
-		case resp.StatusCode >= http.StatusBadRequest:
-			body, _ := io.ReadAll(resp.Body)
-			return nil, fmt.Errorf("on upload registry responded with code %d: %s", resp.StatusCode, body)
+			return nil
 		}

-		return resp, nil
+		return fmt.Errorf("%w: %w", errMaxRetriesExceeded, err)
+
+	case resp.StatusCode == http.StatusUnauthorized:
+		auth := resp.Header.Get("www-authenticate")
+		authRedir := ParseAuthRedirectString(auth)
+		token, err := getAuthToken(ctx, authRedir)
+		if err != nil {
+			return err
+		}
+
+		opts.Token = token
+		fallthrough
+	case resp.StatusCode >= http.StatusBadRequest:
+		body, err := io.ReadAll(resp.Body)
+		if err != nil {
+			return err
+		}
+
+		return fmt.Errorf("http status %d %s: %s", resp.StatusCode, resp.Status, body)
 	}

-	return nil, fmt.Errorf("max retries exceeded")
+	if method == http.MethodPatch {
+		b.nextURL <- nextURL
+	}
+
+	return nil
 }

-type ProgressWriter struct {
-	status    string
-	digest    string
-	bucket    int64
-	completed int64
-	total     int64
-	fn        func(api.ProgressResponse)
-	mu        sync.Mutex
+func (b *blobUpload) acquire() {
+	b.references.Add(1)
 }

-func (pw *ProgressWriter) Write(b []byte) (int, error) {
-	pw.mu.Lock()
-	defer pw.mu.Unlock()
+func (b *blobUpload) release() {
+	if b.references.Add(-1) == 0 {
+		b.CancelFunc()
+	}
+}

-	n := len(b)
-	pw.bucket += int64(n)
+func (b *blobUpload) Wait(ctx context.Context, fn func(api.ProgressResponse)) error {
+	b.acquire()
+	defer b.release()

-	// throttle status updates to not spam the client
-	if pw.bucket >= 1024*1024 || pw.completed+pw.bucket >= pw.total {
-		pw.completed += pw.bucket
-		pw.fn(api.ProgressResponse{
-			Status:    pw.status,
-			Digest:    pw.digest,
-			Total:     pw.total,
-			Completed: pw.completed,
+	ticker := time.NewTicker(60 * time.Millisecond)
+	for {
+		select {
+		case <-ticker.C:
+		case <-ctx.Done():
+			return ctx.Err()
+		}
+
+		fn(api.ProgressResponse{
+			Status:    fmt.Sprintf("uploading %s", b.Digest),
+			Digest:    b.Digest,
+			Total:     b.Total,
+			Completed: b.Completed.Load(),
 		})

-		pw.bucket = 0
+		if b.done || b.err != nil {
+			return b.err
+		}
 	}
+}

+type blobUploadPart struct {
+	// N is the part number
+	N      int
+	Offset int64
+	Size   int64
+	hash.Hash
+
+	written int64
+
+	io.ReadSeeker
+	*blobUpload
+}
+
+func (p *blobUploadPart) Write(b []byte) (n int, err error) {
+	n = len(b)
+	p.written += int64(n)
+	p.Completed.Add(int64(n))
 	return n, nil
 }
+
+func (p *blobUploadPart) Reset() {
+	p.Seek(0, io.SeekStart)
+	p.Completed.Add(-int64(p.written))
+	p.written = 0
+	p.Hash = md5.New()
+}
+
+func uploadBlob(ctx context.Context, mp ModelPath, layer *Layer, opts *RegistryOptions, fn func(api.ProgressResponse)) error {
+	requestURL := mp.BaseURL()
+	requestURL = requestURL.JoinPath("v2", mp.GetNamespaceRepository(), "blobs", layer.Digest)
+
+	resp, err := makeRequestWithRetry(ctx, http.MethodHead, requestURL, nil, nil, opts)
+	switch {
+	case errors.Is(err, os.ErrNotExist):
+	case err != nil:
+		return err
+	default:
+		defer resp.Body.Close()
+		fn(api.ProgressResponse{
+			Status:    fmt.Sprintf("uploading %s", layer.Digest),
+			Digest:    layer.Digest,
+			Total:     layer.Size,
+			Completed: layer.Size,
+		})
+
+		return nil
+	}
+
+	data, ok := blobUploadManager.LoadOrStore(layer.Digest, &blobUpload{Layer: layer})
+	upload := data.(*blobUpload)
+	if !ok {
+		requestURL := mp.BaseURL()
+		requestURL = requestURL.JoinPath("v2", mp.GetNamespaceRepository(), "blobs/uploads/")
+		if err := upload.Prepare(ctx, requestURL, opts); err != nil {
+			blobUploadManager.Delete(layer.Digest)
+			return err
+		}
+
+		go upload.Run(context.Background(), opts)
+	}
+
+	return upload.Wait(ctx, fn)
+}
Author	SHA1	Message	Date
Jeffrey Morgan	1d78d96fc6	remove `.First`	2023-11-15 18:07:13 -05:00
Michael Yang	686f85d6ca	Merge pull request #1132 from jmorganca/mxyng/human-bytes replace go-humanize with format.HumanBytes	2023-11-15 09:46:21 -08:00
bnodnarb	85951d25ef	Created tutorial for running Ollama on NVIDIA Jetson devices (#1098 )	2023-11-15 12:32:37 -05:00
Michael Yang	01ea6002c4	replace go-humanize with format.HumanBytes	2023-11-14 14:57:41 -08:00
Jeffrey Morgan	423862042a	treat `ollama run model < file` as entire prompt, not prompt-per-line (#1126 ) Previously, `ollama run` treated a non-terminal stdin (such as `ollama run model < file`) as containing one prompt per line. To run inference on a multi-line prompt, the only non-API workaround was to run `ollama run` interactively and wrap the prompt in `"""..."""`. Now, `ollama run` treats a non-terminal stdin as containing a single prompt. For example, if `myprompt.txt` is a multi-line file, then `ollama run model < myprompt.txt` would treat `myprompt.txt`'s entire contents as the prompt. Co-authored-by: Quinn Slack <quinn@slack.org>	2023-11-14 16:42:21 -05:00
Bruce MacDonald	df18486c35	Move /generate format to optional parameters (#1127 ) This field is optional and should be under the `Advanced parameters` header	2023-11-14 16:12:30 -05:00
Jeffrey Morgan	4e612a2e92	use stdout fd for terminal size (#1125 )	2023-11-14 16:09:09 -05:00
Jeffrey Morgan	6e0f686afa	`--format json` should work in interactive mode	2023-11-14 10:22:03 -05:00
Jeffrey Morgan	c1844bbee2	add json mode to cli (#1095 )	2023-11-13 21:54:02 -05:00
Huy Le	cb745965ce	adding ollama.nvim for visibility (#1115 )	2023-11-13 17:00:17 -05:00
Enrico Ros	8d29b6a2b6	New big-AGI integration (#1078 ) * New big-AGI integration Ollama works great in big-AGI, and this document explains how to link the two projects. * Update README.md	2023-11-13 16:59:00 -05:00
Ilya Breitburg	724aa64bee	Add Dart library to README.md (#1106 )	2023-11-13 14:50:42 -05:00
Michael Yang	d91c103e74	Merge pull request #1055 from dansreis/946-fix-incorrect-base-model-name Fixed incorrect base model name	2023-11-13 08:42:55 -08:00
Kevin Hermawan	98ec7d81e3	Add OllamaKit to the community integrations (#1085 )	2023-11-11 14:41:42 -08:00
Daniel Reis	7c438f2c53	Replaced method	2023-11-10 20:22:03 +00:00
Daniel Reis	6e46338d44	Reverting previous changes	2023-11-10 20:21:35 +00:00
Jeffrey Morgan	cdddd3df65	add `format` to example python client	2023-11-10 10:22:21 -08:00
Daniel Hiltgen	afa61bdf45	Merge pull request #1075 from jmorganca/dhiltgen/unexpected-eof Resume chunk download on UnexpectedEOF errors	2023-11-10 08:48:27 -08:00
Daniel Hiltgen	cc54a416c6	Resume chunk download on UnexpectedEOF errors If the chunk download is interrupted, resume from where we left off	2023-11-10 08:29:42 -08:00
Matt Williams	c819d7f68a	Merge pull request #955 from jmorganca/mattw/example-bash-compare docs: add examples using bash to compare models	2023-11-10 08:59:32 -06:00
Jeffrey Morgan	5cba29b9d6	JSON mode: add `"format" as an api parameter (#1051 ) * add `"format": "json"` as an API parameter --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-11-09 16:44:02 -08:00
Daniel Reis	d17730356a	Removed inline parse model path	2023-11-09 22:44:26 +00:00
Daniel Reis	32d79a6eea	Using 'GetShortTagname' method instead	2023-11-09 22:40:37 +00:00
Bruce MacDonald	5b39503bcd	document specifying multiple stop params (#1061 )	2023-11-09 13:16:26 -08:00
Bruce MacDonald	1ae84bc2a2	skip gpu if less than 2GB VRAM are available (#1059 )	2023-11-09 13:16:16 -08:00
Bruce MacDonald	db8bf336fc	Update README.md	2023-11-09 12:53:24 -08:00
Nick Anderson	d77e094a90	Added gptel to list of integrations (#1062 )	2023-11-09 12:52:36 -08:00
Matt Williams	dd3dc47ddb	Merge pull request #992 from aashish2057/aashish2057/langchainjs_doc_update	2023-11-09 05:08:31 -08:00
Michael Yang	c5e1bbabda	instead of static number of parameters for each model family, get the real number from the tensors (#1022 ) * parse tensor info * refactor decoder * return actual parameter count * explicit rounding * s/Human/HumanNumber/	2023-11-08 17:55:46 -08:00
Bruce MacDonald	a49d6acc1e	add a complete /generate options example (#1035 )	2023-11-08 16:44:36 -08:00
Moritz Poldrack	6e9bcdb9b3	progressbar: make start and end seamless (#1042 )	2023-11-08 16:42:40 -08:00
Matt Williams	13086363bd	Update as per bmacd Signed-off-by: Matt Williams <m@technovangelist.com>	2023-11-08 18:09:05 -06:00
Bruce MacDonald	ec2a31e9b3	support raw generation requests (#952 ) - add the optional `raw` generate request parameter to bypass prompt formatting and response context -add raw request to docs	2023-11-08 14:05:02 -08:00
Amith Koujalgi	ec84c02d54	Add Ollama4j Java library to the list of community libraries (#1044 )	2023-11-08 11:04:32 -08:00
Kevin Hermawan	2a88b66bc9	Add Ollamac to community integrations (#1043 )	2023-11-08 11:01:09 -08:00
Jeffrey Morgan	2d0faea96c	clean up `README.md`	2023-11-08 00:03:29 -08:00
Jeffrey Morgan	637142181a	clean up `README.md`	2023-11-07 23:52:31 -08:00
Matt Williams	bcbff421c9	Merge pull request #1023 from jmorganca/mattw/wherearemodelsfaq	2023-11-07 17:59:54 -08:00
thealhu	1359d6cf3b	Fix sudo variable in install.sh (#1034 ) It was forgotten to replace sudo at one place with the variable for sudo.	2023-11-07 09:59:57 -08:00
Omar Magdy	6e2d0224d9	Added logseq ollama plugin (#1029 )	2023-11-07 09:58:13 -08:00
Ikko Eltociear Ashimine	921406f721	Update client.py (#1026 ) recieve -> receive	2023-11-07 09:55:47 -08:00
Michael Yang	c7047d7353	Merge pull request #959 from jmorganca/mxyng/example-k8s	2023-11-07 10:43:21 -06:00
Matt Williams	1d155caba3	docs: clarify where the models are stored in the faq Signed-off-by: Matt Williams <m@technovangelist.com>	2023-11-06 14:38:49 -08:00
Michael Yang	866324b9a5	Merge pull request #943 from tjbck/patch-1 doc: categorised community integrations + added ollama-webui	2023-11-06 11:35:39 -08:00
Michael Yang	145e060855	Apply suggestions from code review Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-11-06 11:32:23 -08:00
Michael Yang	146072113d	Merge pull request #993 from jmorganca/mxyng/cleanup cleanup upload and download errors	2023-11-06 11:32:12 -08:00
Timothy Jaeryang Baek	33d31d1b56	Merge branch 'main' into patch-1	2023-11-06 14:27:02 -05:00
Dr. David A. Kunz	274c6cbf4c	Added gen.nvim to community integrations (#996 )	2023-11-06 10:51:41 -08:00
Elton Renda	7ebbd89bbf	add hass-ollama-conversation (#999 )	2023-11-06 10:50:35 -08:00
Lars Grammel	9079b1bb6d	Add ModelFusion community integration (#1020 )	2023-11-06 10:46:16 -08:00
Timothy Jaeryang Baek	6febde7200	Merge branch 'main' into patch-1	2023-11-04 19:12:18 -05:00
pepperoni21	325cfcd9ff	Added ollama-rs to community integrations (#995 ) Co-authored-by: pepperoni21 <pepperoni2100@gmail.com>	2023-11-04 14:51:29 -07:00
Jeffrey Morgan	639d0fd070	Update README.md	2023-11-04 12:24:24 -07:00
Jeffrey Morgan	e21579a0f1	Restore system prompt on requests	2023-11-03 17:26:45 -07:00
Jeffrey Morgan	c44b619428	remove unused `fmt.Println`	2023-11-03 17:24:58 -07:00
Michael Yang	434a6f9d46	return last error	2023-11-03 16:49:51 -07:00
aashish2057	b13586cc72	update langchainjs doc	2023-11-03 18:45:19 -05:00
Jeffrey Morgan	17678b7225	Restore system prompt on requests and default `num_keep` to `0`	2023-11-03 13:25:25 -07:00
Michael Yang	84725ec7e3	refactor part reset	2023-11-03 09:20:32 -07:00
Bruce MacDonald	6109bebba6	reformat api docs for more examples (#972 )	2023-11-03 10:57:00 -04:00
Noah Gitsham	8ae8c9fa8c	Remove duplicate "install" in GPU support warning (#984 )	2023-11-03 00:45:14 -07:00
Noah Gitsham	f39daff461	Add missing "be" to GPU support warning message (#983 )	2023-11-02 18:37:12 -07:00
Jeffrey Morgan	c50b01bc21	check `request.Context` for initial system prompt	2023-11-02 18:17:00 -07:00
Bruce MacDonald	b9dc875401	remove modelfile context deprecated in v0.0.7 (#974 )	2023-11-02 20:52:56 -04:00
Jeffrey Morgan	06589a3b30	Set `NumKeep` to `4` by default (#982 )	2023-11-02 17:26:11 -07:00
Michael Yang	1fd511e661	Merge pull request #975 from jmorganca/mxyng/downloads update downloads to use retry wrapper	2023-11-02 16:12:48 -07:00
Michael Yang	c01bbe94fd	Merge pull request #979 from jmorganca/mxyng/num-keep update default NumKeep	2023-11-02 15:48:44 -07:00
Jeffrey Morgan	1beb5645a9	only use system prompt if context is not provided (#978 )	2023-11-02 15:48:02 -07:00
Michael Yang	6db3691b8f	update default NumKeep	2023-11-02 15:47:35 -07:00
Michael Yang	fe5a872444	fix upload	2023-11-02 13:25:58 -07:00
Michael Yang	d39709260f	download with retry	2023-11-02 13:16:11 -07:00
Michael Yang	60bb3c03a1	use http.Method	2023-11-02 13:12:45 -07:00
Jeffrey Morgan	2e53704685	default rope params to 0 for new models (#968 )	2023-11-02 08:41:30 -07:00
Michael Yang	527f9a7975	Merge pull request #966 from jmorganca/mxyng/fix-log	2023-11-01 17:49:10 -07:00
Michael Yang	c4cc738cbf	fix log	2023-11-01 17:18:11 -07:00
Michael Yang	2c6189f4fe	Merge pull request #750 from jmorganca/mxyng/concurrent-uploads concurrent uploads	2023-11-01 15:00:01 -07:00
Michael Yang	dccac8c8fa	k8s example	2023-11-01 14:52:58 -07:00
Michael Yang	c05ab9a86e	Merge pull request #965 from jmorganca/mxyng/go-mod-tidy go mod tidy	2023-11-01 11:55:43 -07:00
Michael Yang	f42f3d9b27	go fmt	2023-11-01 11:55:08 -07:00
Michael Yang	341fb7e35f	go mod tidy	2023-11-01 11:54:25 -07:00
Michael	f31961637f	Update README.md	2023-11-01 12:20:55 -04:00
Michael Yang	ec3614812a	Merge pull request #960 from jmorganca/mxyng/fix-tautology	2023-11-01 08:30:49 -07:00
Michael Yang	f14969314a	Merge pull request #958 from jmorganca/mxyng/append-ld-library-path	2023-11-01 08:30:38 -07:00
Bruce MacDonald	1fb9288661	notify that the ollama api is available after linux install (#954 )	2023-11-01 11:28:26 -04:00
Matt Williams	01a03caa20	Merge pull request #956 from jmorganca/mattw/apidocupdate	2023-10-31 21:43:11 -07:00
Michael Yang	bf6786bb39	fix tautology	2023-10-31 20:49:48 -07:00
Michael Yang	642128b75a	append LD_LIBRARY_PATH	2023-10-31 15:54:49 -07:00
Matt Williams	f21bd6210d	docs: clarify and clean up API docs Signed-off-by: Matt Williams <m@technovangelist.com>	2023-10-31 13:11:33 -07:00
Matt Williams	80362fedce	better readme Signed-off-by: Matt Williams <m@technovangelist.com>	2023-10-31 12:40:46 -07:00
Matt Williams	5757925060	add a gif Signed-off-by: Matt Williams <m@technovangelist.com>	2023-10-31 11:52:01 -07:00
Michael	4512301756	Update README.md	2023-10-31 13:25:36 -04:00
Matt Williams	2236a93efc	docs: add examples using bash to compare models Signed-off-by: Matt Williams <m@technovangelist.com>	2023-10-31 09:12:39 -07:00
Matt Williams	ad88799411	Merge pull request #949 from jmorganca/matt/fixPrivateGPT fix: private gpt example was broken due to changes in chroma	2023-10-30 17:17:00 -07:00
Bruce MacDonald	0818b5e318	readline windows terminal support (#950 ) - update the readline package to have basic support on windows, this is not full feature parity with the unix cli yet	2023-10-30 16:18:12 -04:00
Matt Williams	1df6100c77	Update examples/langchain-python-rag-privategpt/privateGPT.py Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-10-30 12:48:17 -07:00
Matt Williams	5c48fe1fb0	Update examples/langchain-python-rag-privategpt/constants.py Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-10-30 12:47:56 -07:00
Dirk Loss	874bb31986	Fix conversion command for gptneox (#948 )	2023-10-30 14:34:29 -04:00
Matt Williams	f7856a57eb	fix: private gpt example was broken due to changes in chroma Signed-off-by: Matt Williams <m@technovangelist.com>	2023-10-30 10:56:25 -07:00
Bruce MacDonald	f9a4281124	clean up: remove server functions from client (#937 )	2023-10-30 11:10:18 -04:00
Timothy Jaeryang Baek	96da0792e6	doc: OllamaSharp for .NET moved to libraries	2023-10-28 16:18:38 -05:00
Timothy Jaeryang Baek	95d24262fc	doc: categorised community integrations + added web-ui	2023-10-28 16:02:13 -05:00
Jeffrey Morgan	8d03bd7b54	remove `+build` directive in `term.go`	2023-10-28 09:56:03 -07:00
Jeffrey Morgan	9ec16f0f03	fix formatting when exiting `ollama run`	2023-10-27 21:26:23 -07:00
Jeffrey Morgan	57a58db1b0	history: update pos after compact	2023-10-27 20:38:03 -07:00
Jeffrey Morgan	2d75a4537c	close input channel when receiving `io.EOF`	2023-10-27 20:26:04 -07:00
Jeffrey Morgan	4748609611	Don't quit ioloop on `NUL` character (#940 ) * dont quit ioloop on 0 rune * check for closed channel * remove unused error on `Close()`	2023-10-27 20:01:48 -07:00
Jeffrey Morgan	c0dcea1398	Update faq.md	2023-10-27 18:29:00 -07:00
Michael Yang	115fc56eb7	calculate and verify md5 checksum	2023-10-27 17:07:33 -07:00
Michael Yang	186f685224	retry PUT	2023-10-27 17:07:33 -07:00
Michael Yang	12efcbb057	comments	2023-10-27 17:07:33 -07:00
Michael Yang	4e09aab8b9	concurrent uploads	2023-10-27 17:07:33 -07:00
Jeffrey Morgan	3a1ed9ff70	restore building runner with `AVX` on by default (#900 )	2023-10-27 12:13:44 -07:00
Bruce MacDonald	6d283882b1	catch insufficient permissions nvidia err (#934 )	2023-10-27 12:42:40 -04:00
Bruce MacDonald	5c3491f425	allow for a configurable ollama model storage directory (#897 ) * allow for a configurable ollama models directory - set OLLAMA_MODELS in the environment that ollama is running in to change where model files are stored - update docs Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com> Co-Authored-By: Jay Nakrani <dhananjaynakrani@gmail.com> Co-Authored-By: Akhil Acharya <akhilcacharya@gmail.com> Co-Authored-By: Sasha Devol <sasha.devol@protonmail.com>	2023-10-27 10:19:59 -04:00
James Braza	e5d1ce4dde	Tweaks to `README.md` (#906 ) * Mentioned Docker Hub in docs * Consolidated brew installs to one line	2023-10-27 00:10:23 -07:00
Bruce MacDonald	2665f3c28e	offload 75% of available vram to improve stability (#921 )	2023-10-26 20:49:55 -04:00
Patrick Devine	a79f030e75	add bracketed paste mode (#922 )	2023-10-26 15:57:00 -07:00
Michael Yang	9bc5864a03	Merge pull request #918 from jmorganca/mxyng/fix-out-of-space fix(download): no retry when out of space	2023-10-26 12:24:20 -07:00
Michael Yang	b88cc0fac9	Merge pull request #916 from jmorganca/mxyng/fix-client-host fix(client): trim trailing slash	2023-10-26 12:24:12 -07:00
Patrick Devine	5b2cf16397	fix docker build annotations (#917 )	2023-10-26 12:00:33 -07:00
Michael Yang	910816a532	fix(download): no retry when out of space	2023-10-26 11:34:07 -07:00
Michael Yang	28c3f288e2	client: fix trailing slash	2023-10-26 11:09:38 -07:00
Patrick Devine	deeac961bb	new readline library (#847 )	2023-10-25 16:41:18 -07:00
Jeffrey Morgan	49443e7da5	fix typo in `README.md`	2023-10-25 16:19:27 -07:00
Ajay Kemparaj	bb8464c0d2	update golang.org/x/net fixes CVE-2023-3978,CVE-2023-39325,CVE-2023-44487 (#855 )	2023-10-25 16:17:24 -07:00
Michael Yang	daa5bb4473	Merge pull request #907 from jmorganca/mxyng/linux update linux.md	2023-10-25 15:03:34 -07:00
Michael Yang	92119de9d8	update linux.md	2023-10-25 14:57:50 -07:00
Michael Yang	53b0ba8d43	Merge pull request #893 from jmorganca/mxyng/update-faq update faq	2023-10-24 16:02:35 -07:00
Michael Yang	db342691f9	Update docs/faq.md Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-10-24 13:59:33 -07:00
Bruce MacDonald	cecf83141e	Linux uninstall instructions (#894 )	2023-10-24 14:07:05 -04:00
Michael Yang	a5a2adf1ec	update faq	2023-10-24 10:54:16 -07:00
Jeffrey Morgan	b0c9cd0f3b	fix metal assertion errors	2023-10-24 00:32:36 -07:00
Jeffrey Morgan	77f61c6301	update submodule commit	2023-10-24 00:30:27 -07:00
Jeffrey Morgan	f3604534e5	update submodule commit	2023-10-23 23:59:12 -07:00
Jeffrey Morgan	914428351a	Update import.md	2023-10-23 17:44:53 -07:00
Jeffrey Morgan	9afea9e3b9	Update import.md Separate GGUF and PyTorch guides	2023-10-23 17:42:17 -07:00
Bruce MacDonald	c039432b5c	add current user to ollama group on install (#772 )	2023-10-23 17:06:31 -04:00
Michael Yang	c345b4ca7c	Merge pull request #884 from jmorganca/mxyng/update-submodules bump submodules	2023-10-23 11:27:38 -07:00
Michael Yang	0c7a00a264	bump submodules pin to 9e70cc03229df19ca2d28ce23cc817198f897278 for now since 438c2ca83045a00ef244093d27e9ed41a8cb4ea9 is breaking	2023-10-23 11:17:59 -07:00
Michael Yang	36c160f1c3	Merge pull request #881 from jmorganca/mxyng/ggufv3 ggufv3	2023-10-23 10:50:45 -07:00
Michael Yang	b66bcaa582	Merge pull request #883 from jmorganca/mxyng/logs update default log target	2023-10-23 10:50:29 -07:00
Michael Yang	c9167494cb	update default log target	2023-10-23 10:44:50 -07:00
Michael Yang	125d0a013a	ggufv3 ggufv3 adds support for big endianness, mainly for s390x architecture. while that's not currently supported for ollama, the change is simple. loosen version check to be more forward compatible. unless specified, gguf versions other v1 will be decoded into v2.	2023-10-23 09:35:49 -07:00
Richard Awoyemi	ba2da6ceaa	Added a minimalist React UI for Ollama models to the community contributions.md (#870 )	2023-10-23 10:44:39 -04:00