wip /api/chat

2025-12-25 00:30:56 -05:00 · 2023-10-01 14:54:17 -07:00
124 changed files with 1782 additions and 58170 deletions
--- a/6
+++ b/6
@@ -1,16 +1,16 @@
 FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

 ARG TARGETARCH
+ARG VERSION=0.0.0
 ARG GOFLAGS="'-ldflags=-w -s'"

 WORKDIR /go/src/github.com/jmorganca/ollama
 RUN apt-get update && apt-get install -y git build-essential cmake
-ADD https://dl.google.com/go/go1.21.3.linux-$TARGETARCH.tar.gz /tmp/go1.21.3.tar.gz
-RUN mkdir -p /usr/local && tar xz -C /usr/local </tmp/go1.21.3.tar.gz
+ADD https://dl.google.com/go/go1.21.1.linux-$TARGETARCH.tar.gz /tmp/go1.21.1.tar.gz
+RUN mkdir -p /usr/local && tar xz -C /usr/local </tmp/go1.21.1.tar.gz

 COPY . .
 ENV GOARCH=$TARGETARCH
-ENV GOFLAGS=$GOFLAGS
 RUN /usr/local/go/bin/go generate ./... \
    && /usr/local/go/bin/go build .

--- a/Dockerfile.build
+++ b/Dockerfile.build
@@ -1,5 +1,6 @@
+
 # centos7 amd64 dependencies
-FROM --platform=linux/amd64 nvidia/cuda:11.3.1-devel-centos7 AS base-amd64
+FROM --platform=linux/amd64 nvidia/cuda:11.8.0-devel-centos7 AS base-amd64
 RUN yum install -y https://repo.ius.io/ius-release-el7.rpm centos-release-scl && \
    yum update -y && \
    yum install -y devtoolset-10-gcc devtoolset-10-gcc-c++ git236 wget
@@ -7,25 +8,25 @@ RUN wget "https://github.com/Kitware/CMake/releases/download/v3.27.6/cmake-3.27.
 ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH

 # centos8 arm64 dependencies
-FROM --platform=linux/arm64 nvidia/cuda-arm64:11.3.1-devel-centos8 AS base-arm64
+FROM --platform=linux/arm64 nvidia/cuda:11.4.3-devel-centos8 AS base-arm64
 RUN sed -i -e 's/mirrorlist/#mirrorlist/g' -e 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
 RUN yum install -y git cmake

 FROM base-${TARGETARCH}
 ARG TARGETARCH
-ARG GOFLAGS="'-ldflags -w -s'"

 # install go
-ADD https://dl.google.com/go/go1.21.3.linux-$TARGETARCH.tar.gz /tmp/go1.21.3.tar.gz
-RUN mkdir -p /usr/local && tar xz -C /usr/local </tmp/go1.21.3.tar.gz
+ADD https://dl.google.com/go/go1.21.1.linux-$TARGETARCH.tar.gz /tmp/go1.21.1.tar.gz
+RUN mkdir -p /usr/local && tar xz -C /usr/local </tmp/go1.21.1.tar.gz

 # build the final binary
 WORKDIR /go/src/github.com/jmorganca/ollama
 COPY . .
-
 ENV GOOS=linux
 ENV GOARCH=$TARGETARCH
-ENV GOFLAGS=$GOFLAGS
+
+ARG VERSION=0.0.0
+ARG GOFLAGS="'-ldflags -w -s'"

 RUN /usr/local/go/bin/go generate ./... && \
    /usr/local/go/bin/go build .
--- a/README.md
+++ b/README.md
@@ -13,11 +13,7 @@ Get up and running with large language models locally.

 ### macOS

-[Download](https://ollama.ai/download/Ollama-darwin.zip)
-
-### Windows
-
-Coming soon!
+[Download](https://ollama.ai/download/Ollama-darwin.zip) 

 ### Linux & WSL2

@@ -27,9 +23,9 @@ curl https://ollama.ai/install.sh | sh

 [Manual install instructions](https://github.com/jmorganca/ollama/blob/main/docs/linux.md)

-### Docker
+### Windows 

-See the official [Docker image](https://hub.docker.com/r/ollama/ollama).
+coming soon

 ## Quickstart

@@ -41,7 +37,7 @@ ollama run llama2

 ## Model library

-Ollama supports a list of open-source models available on [ollama.ai/library](https://ollama.ai/library 'ollama model library')
+Ollama supports a list of open-source models available on [ollama.ai/library](https://ollama.ai/library "ollama model library")

 Here are some example open-source models that can be downloaded:

@@ -60,32 +56,28 @@ Here are some example open-source models that can be downloaded:

 ## Customize your own model

-### Import from GGUF
+### Import from GGUF or GGML

-Ollama supports importing GGUF models in the Modelfile:
+Ollama supports importing GGUF and GGML file formats in the Modelfile. This means if you have a model that is not in the Ollama library, you can create it, iterate on it, and upload it to the Ollama library to share with others when you are ready.

-1. Create a file named `Modelfile`, with a `FROM` instruction with the local filepath to the model you want to import.
+1. Create a file named Modelfile, and add a `FROM` instruction with the local filepath to the model you want to import.

   ```
   FROM ./vicuna-33b.Q4_0.gguf
   ```

-2. Create the model in Ollama
+3. Create the model in Ollama

   ```
-   ollama create example -f Modelfile
+   ollama create name -f path_to_modelfile
   ```

-3. Run the model
+5. Run the model

   ```
-   ollama run example
+   ollama run name
   ```

-### Import from PyTorch or Safetensors
-
-See the [guide](docs/import.md) on importing models for more information.
-
 ### Customize a prompt

 Models from the Ollama library can be customized with a prompt. The example
@@ -117,7 +109,7 @@ ollama run mario
 Hello! It's your friend Mario.
 ```

-For more examples, see the [examples](examples) directory. For more information on working with a Modelfile, see the [Modelfile](docs/modelfile.md) documentation.
+For more examples, see the [examples](./examples) directory. For more information on working with a Modelfile, see the [Modelfile](./docs/modelfile.md) documentation.

 ## CLI Reference

@@ -203,7 +195,7 @@ Finally, in a separate shell, run a model:

 ## REST API

-See the [API documentation](docs/api.md) for all endpoints.
+> See the [API documentation](./docs/api.md) for all endpoints.

 Ollama has an API for running and managing models. For example to generate text from a model:

@@ -225,11 +217,7 @@ curl -X POST http://localhost:11434/api/generate -d '{
 - [Dagger Chatbot](https://github.com/samalba/dagger-chatbot)
 - [LiteLLM](https://github.com/BerriAI/litellm)
 - [Discord AI Bot](https://github.com/mekb-turtle/discord-ai-bot)
- [Chatbot UI](https://github.com/ivanfioravanti/chatbot-ollama)
 - [HTML UI](https://github.com/rtcfirefly/ollama-ui)
 - [Typescript UI](https://github.com/ollama-interface/Ollama-Gui?tab=readme-ov-file)
 - [Dumbar](https://github.com/JerrySievert/Dumbar)
 - [Emacs client](https://github.com/zweifisch/ollama)
- [oterm](https://github.com/ggozad/oterm)
- [Ellama Emacs client](https://github.com/s-kostyaev/ellama)
- [OllamaSharp for .NET](https://github.com/awaescher/OllamaSharp)
--- a/api/client.go
+++ b/api/client.go
@@ -7,24 +7,25 @@ import (
 	"encoding/json"
 	"fmt"
 	"io"
-	"net"
 	"net/http"
 	"net/url"
 	"os"
 	"runtime"
 	"strings"

-	"github.com/jmorganca/ollama/format"
 	"github.com/jmorganca/ollama/version"
 )

 const DefaultHost = "127.0.0.1:11434"

-var envHost = os.Getenv("OLLAMA_HOST")
+var (
+	envHost = os.Getenv("OLLAMA_HOST")
+)

 type Client struct {
-	base *url.URL
-	http http.Client
+	Base    url.URL
+	HTTP    http.Client
+	Headers http.Header
 }

 func checkError(resp *http.Response, body []byte) error {
@@ -43,46 +44,34 @@ func checkError(resp *http.Response, body []byte) error {
 	return apiError
 }

-func ClientFromEnvironment() (*Client, error) {
-	scheme, hostport, ok := strings.Cut(os.Getenv("OLLAMA_HOST"), "://")
-	if !ok {
-		scheme, hostport = "http", os.Getenv("OLLAMA_HOST")
+// Host returns the default host to use for the client. It is determined in the following order:
+// 1. The OLLAMA_HOST environment variable
+// 2. The default host (localhost:11434)
+func Host() string {
+	if envHost != "" {
+		return envHost
+	}
+	return DefaultHost
+}
+
+// FromEnv creates a new client using Host() as the host. An error is returns
+// if the host is invalid.
+func FromEnv() (*Client, error) {
+	h := Host()
+	if !strings.HasPrefix(h, "http://") && !strings.HasPrefix(h, "https://") {
+		h = "http://" + h
 	}

-	host, port, err := net.SplitHostPort(hostport)
+	u, err := url.Parse(h)
 	if err != nil {
-		host, port = "127.0.0.1", "11434"
-		if ip := net.ParseIP(strings.Trim(hostport, "[]")); ip != nil {
-			host = ip.String()
-		} else if hostport != "" {
-			host = hostport
-		}
+		return nil, fmt.Errorf("could not parse host: %w", err)
 	}

-	client := Client{
-		base: &url.URL{
-			Scheme: scheme,
-			Host:   net.JoinHostPort(host, port),
-		},
+	if u.Port() == "" {
+		u.Host += ":11434"
 	}

-	mockRequest, err := http.NewRequest("HEAD", client.base.String(), nil)
-	if err != nil {
-		return nil, err
-	}
-
-	proxyURL, err := http.ProxyFromEnvironment(mockRequest)
-	if err != nil {
-		return nil, err
-	}
-
-	client.http = http.Client{
-		Transport: &http.Transport{
-			Proxy: http.ProxyURL(proxyURL),
-		},
-	}
-
-	return &client, nil
+	return &Client{Base: *u, HTTP: http.Client{}}, nil
 }

 func (c *Client) do(ctx context.Context, method, path string, reqData, respData any) error {
@@ -97,7 +86,7 @@ func (c *Client) do(ctx context.Context, method, path string, reqData, respData
 		reqBody = bytes.NewReader(data)
 	}

-	requestURL := c.base.JoinPath(path)
+	requestURL := c.Base.JoinPath(path)
 	request, err := http.NewRequestWithContext(ctx, method, requestURL.String(), reqBody)
 	if err != nil {
 		return err
@@ -107,7 +96,11 @@ func (c *Client) do(ctx context.Context, method, path string, reqData, respData
 	request.Header.Set("Accept", "application/json")
 	request.Header.Set("User-Agent", fmt.Sprintf("ollama/%s (%s %s) Go/%s", version.Version, runtime.GOARCH, runtime.GOOS, runtime.Version()))

-	respObj, err := c.http.Do(request)
+	for k, v := range c.Headers {
+		request.Header[k] = v
+	}
+
+	respObj, err := c.HTTP.Do(request)
 	if err != nil {
 		return err
 	}
@@ -130,8 +123,6 @@ func (c *Client) do(ctx context.Context, method, path string, reqData, respData
 	return nil
 }

-const maxBufferSize = 512 * format.KiloByte
-
 func (c *Client) stream(ctx context.Context, method, path string, data any, fn func([]byte) error) error {
 	var buf *bytes.Buffer
 	if data != nil {
@@ -143,26 +134,23 @@ func (c *Client) stream(ctx context.Context, method, path string, data any, fn f
 		buf = bytes.NewBuffer(bts)
 	}

-	requestURL := c.base.JoinPath(path)
+	requestURL := c.Base.JoinPath(path)
 	request, err := http.NewRequestWithContext(ctx, method, requestURL.String(), buf)
 	if err != nil {
 		return err
 	}

 	request.Header.Set("Content-Type", "application/json")
-	request.Header.Set("Accept", "application/x-ndjson")
+	request.Header.Set("Accept", "application/json")
 	request.Header.Set("User-Agent", fmt.Sprintf("ollama/%s (%s %s) Go/%s", version.Version, runtime.GOARCH, runtime.GOOS, runtime.Version()))

-	response, err := c.http.Do(request)
+	response, err := http.DefaultClient.Do(request)
 	if err != nil {
 		return err
 	}
 	defer response.Body.Close()

 	scanner := bufio.NewScanner(response.Body)
-	// increase the buffer size to avoid running out of space
-	scanBuf := make([]byte, 0, maxBufferSize)
-	scanner.Buffer(scanBuf, maxBufferSize)
 	for scanner.Scan() {
 		var errorResponse struct {
 			Error string `json:"error,omitempty"`
--- a/api/types.go
+++ b/api/types.go
@@ -3,6 +3,7 @@ package api
 import (
 	"encoding/json"
 	"fmt"
+	"log"
 	"math"
 	"os"
 	"reflect"
@@ -30,13 +31,28 @@ func (e StatusError) Error() string {
 	}
 }

+// /api/chat
+type Message struct {
+	Role    string `json:"role"`
+	Content string `json:"content"`
+}
+
+type ChatRequest struct {
+	Model    string    `json:"model"`
+	Messages []Message `json:"messages"`
+}
+
+type ChatResponse struct {
+	CreatedAt time.Time `json:"created_at"`
+	Message   Message   `json:"message"`
+}
+
 type GenerateRequest struct {
 	Model    string `json:"model"`
 	Prompt   string `json:"prompt"`
 	System   string `json:"system"`
 	Template string `json:"template"`
 	Context  []int  `json:"context,omitempty"`
-	Stream   *bool  `json:"stream,omitempty"`

 	Options map[string]interface{} `json:"options"`
 }
@@ -53,9 +69,8 @@ type EmbeddingResponse struct {
 }

 type CreateRequest struct {
-	Name   string `json:"name"`
-	Path   string `json:"path"`
-	Stream *bool  `json:"stream,omitempty"`
+	Name string `json:"name"`
+	Path string `json:"path"`
 }

 type DeleteRequest struct {
@@ -82,9 +97,6 @@ type CopyRequest struct {
 type PullRequest struct {
 	Name     string `json:"name"`
 	Insecure bool   `json:"insecure,omitempty"`
-	Username string `json:"username"`
-	Password string `json:"password"`
-	Stream   *bool  `json:"stream,omitempty"`
 }

 type ProgressResponse struct {
@@ -97,9 +109,6 @@ type ProgressResponse struct {
 type PushRequest struct {
 	Name     string `json:"name"`
 	Insecure bool   `json:"insecure,omitempty"`
-	Username string `json:"username"`
-	Password string `json:"password"`
-	Stream   *bool  `json:"stream,omitempty"`
 }

 type ListResponse struct {
@@ -120,7 +129,7 @@ type TokenResponse struct {
 type GenerateResponse struct {
 	Model     string    `json:"model"`
 	CreatedAt time.Time `json:"created_at"`
-	Response  string    `json:"response"`
+	Response  string    `json:"response,omitempty"`

 	Done    bool  `json:"done"`
 	Context []int `json:"context,omitempty"`
@@ -161,10 +170,15 @@ func (r *GenerateResponse) Summary() {
 	}
 }

-// Runner options which must be set when the model is loaded into memory
-type Runner struct {
-	UseNUMA            bool    `json:"numa,omitempty"`
+type Options struct {
+	Seed int `json:"seed,omitempty"`
+
+	// Backend options
+	UseNUMA bool `json:"numa,omitempty"`
+
+	// Model options
 	NumCtx             int     `json:"num_ctx,omitempty"`
+	NumKeep            int     `json:"num_keep,omitempty"`
 	NumBatch           int     `json:"num_batch,omitempty"`
 	NumGQA             int     `json:"num_gqa,omitempty"`
 	NumGPU             int     `json:"num_gpu,omitempty"`
@@ -178,15 +192,8 @@ type Runner struct {
 	EmbeddingOnly      bool    `json:"embedding_only,omitempty"`
 	RopeFrequencyBase  float32 `json:"rope_frequency_base,omitempty"`
 	RopeFrequencyScale float32 `json:"rope_frequency_scale,omitempty"`
-	NumThread          int     `json:"num_thread,omitempty"`
-}

-type Options struct {
-	Runner
-
-	// Predict options used at runtime
-	NumKeep          int      `json:"num_keep,omitempty"`
-	Seed             int      `json:"seed,omitempty"`
+	// Predict options
 	NumPredict       int      `json:"num_predict,omitempty"`
 	TopK             int      `json:"top_k,omitempty"`
 	TopP             float32  `json:"top_p,omitempty"`
@@ -202,9 +209,9 @@ type Options struct {
 	MirostatEta      float32  `json:"mirostat_eta,omitempty"`
 	PenalizeNewline  bool     `json:"penalize_newline,omitempty"`
 	Stop             []string `json:"stop,omitempty"`
-}

-var ErrInvalidOpts = fmt.Errorf("invalid options")
+	NumThread int `json:"num_thread,omitempty"`
+}

 func (opts *Options) FromMap(m map[string]interface{}) error {
 	valueOpts := reflect.ValueOf(opts).Elem() // names of the fields in the options struct
@@ -219,7 +226,6 @@ func (opts *Options) FromMap(m map[string]interface{}) error {
 		}
 	}

-	invalidOpts := []string{}
 	for key, val := range m {
 		if opt, ok := jsonOpts[key]; ok {
 			field := valueOpts.FieldByName(opt.Name)
@@ -237,39 +243,44 @@ func (opts *Options) FromMap(m map[string]interface{}) error {
 						// when JSON unmarshals numbers, it uses float64, not int
 						field.SetInt(int64(t))
 					default:
-						return fmt.Errorf("option %q must be of type integer", key)
+						log.Printf("could not convert model parameter %v to int, skipped", key)
 					}
 				case reflect.Bool:
 					val, ok := val.(bool)
 					if !ok {
-						return fmt.Errorf("option %q must be of type boolean", key)
+						log.Printf("could not convert model parameter %v to bool, skipped", key)
+						continue
 					}
 					field.SetBool(val)
 				case reflect.Float32:
 					// JSON unmarshals to float64
 					val, ok := val.(float64)
 					if !ok {
-						return fmt.Errorf("option %q must be of type float32", key)
+						log.Printf("could not convert model parameter %v to float32, skipped", key)
+						continue
 					}
 					field.SetFloat(val)
 				case reflect.String:
 					val, ok := val.(string)
 					if !ok {
-						return fmt.Errorf("option %q must be of type string", key)
+						log.Printf("could not convert model parameter %v to string, skipped", key)
+						continue
 					}
 					field.SetString(val)
 				case reflect.Slice:
 					// JSON unmarshals to []interface{}, not []string
 					val, ok := val.([]interface{})
 					if !ok {
-						return fmt.Errorf("option %q must be of type array", key)
+						log.Printf("could not convert model parameter %v to slice, skipped", key)
+						continue
 					}
 					// convert []interface{} to []string
 					slice := make([]string, len(val))
 					for i, item := range val {
 						str, ok := item.(string)
 						if !ok {
-							return fmt.Errorf("option %q must be of an array of strings", key)
+							log.Printf("could not convert model parameter %v to slice of strings, skipped", key)
+							continue
 						}
 						slice[i] = str
 					}
@@ -278,53 +289,45 @@ func (opts *Options) FromMap(m map[string]interface{}) error {
 					return fmt.Errorf("unknown type loading config params: %v", field.Kind())
 				}
 			}
-		} else {
-			invalidOpts = append(invalidOpts, key)
 		}
 	}
-
-	if len(invalidOpts) > 0 {
-		return fmt.Errorf("%w: %v", ErrInvalidOpts, strings.Join(invalidOpts, ", "))
-	}
 	return nil
 }

 func DefaultOptions() Options {
 	return Options{
-		// options set on request to runner
-		NumPredict:       -1,
-		NumKeep:          -1,
+		Seed: -1,
+
+		UseNUMA: false,
+
+		NumCtx:             2048,
+		NumKeep:            -1,
+		NumBatch:           512,
+		NumGPU:             -1, // -1 here indicates that NumGPU should be set dynamically
+		NumGQA:             1,
+		LowVRAM:            false,
+		F16KV:              true,
+		UseMMap:            true,
+		UseMLock:           false,
+		RopeFrequencyBase:  10000.0,
+		RopeFrequencyScale: 1.0,
+		EmbeddingOnly:      true,
+
+		RepeatLastN:      64,
+		RepeatPenalty:    1.1,
+		FrequencyPenalty: 0.0,
+		PresencePenalty:  0.0,
 		Temperature:      0.8,
 		TopK:             40,
 		TopP:             0.9,
 		TFSZ:             1.0,
 		TypicalP:         1.0,
-		RepeatLastN:      64,
-		RepeatPenalty:    1.1,
-		PresencePenalty:  0.0,
-		FrequencyPenalty: 0.0,
 		Mirostat:         0,
 		MirostatTau:      5.0,
 		MirostatEta:      0.1,
 		PenalizeNewline:  true,
-		Seed:             -1,

-		Runner: Runner{
-			// options set when the model is loaded
-			NumCtx:             2048,
-			RopeFrequencyBase:  10000.0,
-			RopeFrequencyScale: 1.0,
-			NumBatch:           512,
-			NumGPU:             -1, // -1 here indicates that NumGPU should be set dynamically
-			NumGQA:             1,
-			NumThread:          0, // let the runtime decide
-			LowVRAM:            false,
-			F16KV:              true,
-			UseMLock:           false,
-			UseMMap:            true,
-			UseNUMA:            false,
-			EmbeddingOnly:      true,
-		},
+		NumThread: 0, // let the runtime decide
 	}
 }

--- a/app/forge.config.ts
+++ b/app/forge.config.ts
@@ -47,6 +47,16 @@ const config: ForgeConfig = {
  },
  rebuildConfig: {},
  makers: [new MakerSquirrel({}), new MakerZIP({}, ['darwin'])],
+  publishers: [
+    new PublisherGithub({
+      repository: {
+        name: 'ollama',
+        owner: 'jmorganca',
+      },
+      draft: false,
+      prerelease: true,
+    }),
+  ],
  hooks: {
    readPackageJson: async (_, packageJson) => {
      return { ...packageJson, version: process.env.VERSION || packageJson.version }
--- a/app/package-lock.json
+++ b/app/package-lock.json
--- a/app/package.json
+++ b/app/package.json
@@ -46,7 +46,7 @@
    "chmodr": "^1.2.0",
    "copy-webpack-plugin": "^11.0.0",
    "css-loader": "^6.8.1",
-    "electron": "25.9.2",
+    "electron": "25.2.0",
    "eslint": "^8.43.0",
    "eslint-plugin-import": "^2.27.5",
    "fork-ts-checker-webpack-plugin": "^7.3.0",
--- a/app/src/index.ts
+++ b/app/src/index.ts
@@ -162,56 +162,13 @@ app.on('before-quit', () => {
  }
 })

-const updateURL = `https://ollama.ai/api/update?os=${process.platform}&arch=${
-  process.arch
-}&version=${app.getVersion()}&id=${id()}`
-
-let latest = ''
-async function isNewReleaseAvailable() {
-  try {
-    const response = await fetch(updateURL)
-
-    if (!response.ok) {
-      return false
-    }
-
-    if (response.status === 204) {
-      return false
-    }
-
-    const data = await response.json()
-
-    const url = data?.url
-    if (!url) {
-      return false
-    }
-
-    if (latest === url) {
-      return false
-    }
-
-    latest = url
-
-    return true
-  } catch (error) {
-    logger.error(`update check failed - ${error}`)
-    return false
-  }
-}
-
-async function checkUpdate() {
-  const available = await isNewReleaseAvailable()
-  if (available) {
-    logger.info('checking for update')
-    autoUpdater.checkForUpdates()
-  }
-}
-
 function init() {
  if (app.isPackaged) {
-    checkUpdate()
+    autoUpdater.checkForUpdates()
    setInterval(() => {
-      checkUpdate()
+      if (!updateAvailable) {
+        autoUpdater.checkForUpdates()
+      }
    }, 60 * 60 * 1000)
  }

@@ -289,7 +246,11 @@ function id(): string {
  return uuid
 }

-autoUpdater.setFeedURL({ url: updateURL })
+autoUpdater.setFeedURL({
+  url: `https://ollama.ai/api/update?os=${process.platform}&arch=${
+    process.arch
+  }&version=${app.getVersion()}&id=${id()}`,
+})

 autoUpdater.on('error', e => {
  logger.error(`update check failed - ${e.message}`)
--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@@ -61,7 +61,7 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
 		return err
 	}

-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -78,12 +78,18 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
 				spinner.Stop()
 			}
 			currentDigest = resp.Digest
-			// pulling
-			bar = progressbar.DefaultBytes(
-				resp.Total,
-				resp.Status,
-			)
-			bar.Set64(resp.Completed)
+			switch {
+			case strings.Contains(resp.Status, "embeddings"):
+				bar = progressbar.Default(resp.Total, resp.Status)
+				bar.Set64(resp.Completed)
+			default:
+				// pulling
+				bar = progressbar.DefaultBytes(
+					resp.Total,
+					resp.Status,
+				)
+				bar.Set64(resp.Completed)
+			}
 		} else if resp.Digest == currentDigest && resp.Digest != "" {
 			bar.Set64(resp.Completed)
 		} else {
@@ -113,7 +119,7 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
 }

 func RunHandler(cmd *cobra.Command, args []string) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -138,7 +144,7 @@ func RunHandler(cmd *cobra.Command, args []string) error {
 }

 func PushHandler(cmd *cobra.Command, args []string) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -182,7 +188,7 @@ func PushHandler(cmd *cobra.Command, args []string) error {
 }

 func ListHandler(cmd *cobra.Command, args []string) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -215,7 +221,7 @@ func ListHandler(cmd *cobra.Command, args []string) error {
 }

 func DeleteHandler(cmd *cobra.Command, args []string) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -231,7 +237,7 @@ func DeleteHandler(cmd *cobra.Command, args []string) error {
 }

 func ShowHandler(cmd *cobra.Command, args []string) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -309,7 +315,7 @@ func ShowHandler(cmd *cobra.Command, args []string) error {
 }

 func CopyHandler(cmd *cobra.Command, args []string) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -332,7 +338,7 @@ func PullHandler(cmd *cobra.Command, args []string) error {
 }

 func pull(model string, insecure bool) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -374,20 +380,7 @@ func pull(model string, insecure bool) error {
 func RunGenerate(cmd *cobra.Command, args []string) error {
 	if len(args) > 1 {
 		// join all args into a single prompt
-		wordWrap := false
-		if term.IsTerminal(int(os.Stdout.Fd())) {
-			wordWrap = true
-		}
-
-		nowrap, err := cmd.Flags().GetBool("nowordwrap")
-		if err != nil {
-			return err
-		}
-		if nowrap {
-			wordWrap = false
-		}
-
-		return generate(cmd, args[0], strings.Join(args[1:], " "), wordWrap)
+		return generate(cmd, args[0], strings.Join(args[1:], " "))
 	}

 	if readline.IsTerminal(int(os.Stdin.Fd())) {
@@ -399,8 +392,8 @@ func RunGenerate(cmd *cobra.Command, args []string) error {

 type generateContextKey string

-func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {
-	client, err := api.ClientFromEnvironment()
+func generate(cmd *cobra.Command, model, prompt string) error {
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -415,9 +408,24 @@ func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {
 		generateContext = []int{}
 	}

+	var wrapTerm bool
+	termType := os.Getenv("TERM")
+	if termType == "xterm-256color" {
+		wrapTerm = true
+	}
+
 	termWidth, _, err := term.GetSize(int(0))
 	if err != nil {
-		wordWrap = false
+		wrapTerm = false
+	}
+
+	// override wrapping if the user turned it off
+	nowrap, err := cmd.Flags().GetBool("nowordwrap")
+	if err != nil {
+		return err
+	}
+	if nowrap {
+		wrapTerm = false
 	}

 	cancelCtx, cancel := context.WithCancel(context.Background())
@@ -444,7 +452,7 @@ func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {

 		latest = response

-		if wordWrap {
+		if wrapTerm {
 			for _, ch := range response.Response {
 				if currentLineLength+1 > termWidth-5 {
 					// backtrack the length of the last word and clear to the end of the line
@@ -473,7 +481,18 @@ func generate(cmd *cobra.Command, model, prompt string, wordWrap bool) error {
 	}

 	if err := client.Generate(cancelCtx, &request, fn); err != nil {
-		if strings.Contains(err.Error(), "context canceled") && abort {
+		if strings.Contains(err.Error(), "failed to load model") {
+			// tell the user to check the server log, if it exists locally
+			home, nestedErr := os.UserHomeDir()
+			if nestedErr != nil {
+				// return the original error
+				return err
+			}
+			logPath := filepath.Join(home, ".ollama", "logs", "server.log")
+			if _, nestedErr := os.Stat(logPath); nestedErr == nil {
+				err = fmt.Errorf("%w\nFor more details, check the error logs at %s", err, logPath)
+			}
+		} else if strings.Contains(err.Error(), "context canceled") && abort {
 			spinner.Finish()
 			return nil
 		}
@@ -514,7 +533,7 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 	}

 	// load the model
-	if err := generate(cmd, model, "", false); err != nil {
+	if err := generate(cmd, model, ""); err != nil {
 		return err
 	}

@@ -541,35 +560,8 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 	)

 	usage := func() {
-		fmt.Fprintln(os.Stderr, "Available Commands:")
-		fmt.Fprintln(os.Stderr, "  /set         Set session variables")
-		fmt.Fprintln(os.Stderr, "  /show        Show model information")
-		fmt.Fprintln(os.Stderr, "  /bye         Exit")
-		fmt.Fprintln(os.Stderr, "  /?, /help    Help for a command")
-		fmt.Fprintln(os.Stderr, "")
-		fmt.Fprintln(os.Stderr, "Use \"\"\" to begin a multi-line message.")
-		fmt.Fprintln(os.Stderr, "")
-	}
-
-	usageSet := func() {
-		fmt.Fprintln(os.Stderr, "Available Commands:")
-		fmt.Fprintln(os.Stderr, "  /set history      Enable history")
-		fmt.Fprintln(os.Stderr, "  /set nohistory    Disable history")
-		fmt.Fprintln(os.Stderr, "  /set wordwrap     Enable wordwrap")
-		fmt.Fprintln(os.Stderr, "  /set nowordwrap   Disable wordwrap")
-		fmt.Fprintln(os.Stderr, "  /set verbose      Show LLM stats")
-		fmt.Fprintln(os.Stderr, "  /set quiet        Disable LLM stats")
-		fmt.Fprintln(os.Stderr, "")
-	}
-
-	usageShow := func() {
-		fmt.Fprintln(os.Stderr, "Available Commands:")
-		fmt.Fprintln(os.Stderr, "  /show license      Show model license")
-		fmt.Fprintln(os.Stderr, "  /show modelfile    Show Modelfile for this model")
-		fmt.Fprintln(os.Stderr, "  /show parameters   Show parameters for this model")
-		fmt.Fprintln(os.Stderr, "  /show system       Show system prompt")
-		fmt.Fprintln(os.Stderr, "  /show template     Show prompt template")
-		fmt.Fprintln(os.Stderr, "")
+		fmt.Fprintln(os.Stderr, "commands:")
+		fmt.Fprintln(os.Stderr, completer.Tree("  "))
 	}

 	var painter Painter
@@ -587,21 +579,6 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 	}
 	defer scanner.Close()

-	var wordWrap bool
-	termType := os.Getenv("TERM")
-	if termType == "xterm-256color" {
-		wordWrap = true
-	}
-
-	// override wrapping if the user turned it off
-	nowrap, err := cmd.Flags().GetBool("nowordwrap")
-	if err != nil {
-		return err
-	}
-	if nowrap {
-		wordWrap = false
-	}
-
 	var multiLineBuffer string
 	var isMultiLine bool

@@ -655,10 +632,10 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 				case "nohistory":
 					scanner.HistoryDisable()
 				case "wordwrap":
-					wordWrap = true
+					cmd.Flags().Set("nowordwrap", "false")
 					fmt.Println("Set 'wordwrap' mode.")
 				case "nowordwrap":
-					wordWrap = false
+					cmd.Flags().Set("nowordwrap", "true")
 					fmt.Println("Set 'nowordwrap' mode.")
 				case "verbose":
 					cmd.Flags().Set("verbose", "true")
@@ -683,17 +660,12 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 					fmt.Printf("Unknown command '/set %s'. Type /? for help\n", args[1])
 				}
 			} else {
-				usageSet()
+				usage()
 			}
 		case strings.HasPrefix(line, "/show"):
 			args := strings.Fields(line)
 			if len(args) > 1 {
-				client, err := api.ClientFromEnvironment()
-				if err != nil {
-					fmt.Println("error: couldn't connect to ollama server")
-					return err
-				}
-				resp, err := client.Show(cmd.Context(), &api.ShowRequest{Name: model})
+				resp, err := server.GetModelInfo(model)
 				if err != nil {
 					fmt.Println("error: couldn't get model")
 					return err
@@ -701,49 +673,23 @@ func generateInteractive(cmd *cobra.Command, model string) error {

 				switch args[1] {
 				case "license":
-					if resp.License == "" {
-						fmt.Print("No license was specified for this model.\n\n")
-					} else {
-						fmt.Println(resp.License)
-					}
+					fmt.Println(resp.License)
 				case "modelfile":
 					fmt.Println(resp.Modelfile)
 				case "parameters":
-					if resp.Parameters == "" {
-						fmt.Print("No parameters were specified for this model.\n\n")
-					} else {
-						fmt.Println(resp.Parameters)
-					}
+					fmt.Println(resp.Parameters)
 				case "system":
-					if resp.System == "" {
-						fmt.Print("No system prompt was specified for this model.\n\n")
-					} else {
-						fmt.Println(resp.System)
-					}
+					fmt.Println(resp.System)
 				case "template":
-					if resp.Template == "" {
-						fmt.Print("No prompt template was specified for this model.\n\n")
-					} else {
-						fmt.Println(resp.Template)
-					}
+					fmt.Println(resp.Template)
 				default:
 					fmt.Printf("Unknown command '/show %s'. Type /? for help\n", args[1])
 				}
-			} else {
-				usageShow()
-			}
-		case strings.HasPrefix(line, "/help"), strings.HasPrefix(line, "/?"):
-			args := strings.Fields(line)
-			if len(args) > 1 {
-				switch args[1] {
-				case "set", "/set":
-					usageSet()
-				case "show", "/show":
-					usageShow()
-				}
 			} else {
 				usage()
 			}
+		case line == "/help", line == "/?":
+			usage()
 		case line == "/exit", line == "/bye":
 			return nil
 		case strings.HasPrefix(line, "/"):
@@ -752,7 +698,7 @@ func generateInteractive(cmd *cobra.Command, model string) error {
 		}

 		if len(line) > 0 && line[0] != '/' {
-			if err := generate(cmd, model, line, wordWrap); err != nil {
+			if err := generate(cmd, model, line); err != nil {
 				return err
 			}
 		}
@@ -764,7 +710,7 @@ func generateBatch(cmd *cobra.Command, model string) error {
 	for scanner.Scan() {
 		prompt := scanner.Text()
 		fmt.Printf(">>> %s\n", prompt)
-		if err := generate(cmd, model, prompt, false); err != nil {
+		if err := generate(cmd, model, prompt); err != nil {
 			return err
 		}
 	}
@@ -894,7 +840,7 @@ func startMacApp(client *api.Client) error {
 }

 func checkServerHeartbeat(_ *cobra.Command, _ []string) error {
-	client, err := api.ClientFromEnvironment()
+	client, err := api.FromEnv()
 	if err != nil {
 		return err
 	}
@@ -932,7 +878,7 @@ func NewCLI() *cobra.Command {
 	createCmd := &cobra.Command{
 		Use:     "create MODEL",
 		Short:   "Create a model from a Modelfile",
-		Args:    cobra.ExactArgs(1),
+		Args:    cobra.MinimumNArgs(1),
 		PreRunE: checkServerHeartbeat,
 		RunE:    CreateHandler,
 	}
@@ -942,7 +888,7 @@ func NewCLI() *cobra.Command {
 	showCmd := &cobra.Command{
 		Use:     "show MODEL",
 		Short:   "Show information for a model",
-		Args:    cobra.ExactArgs(1),
+		Args:    cobra.MinimumNArgs(1),
 		PreRunE: checkServerHeartbeat,
 		RunE:    ShowHandler,
 	}
@@ -969,14 +915,13 @@ func NewCLI() *cobra.Command {
 		Use:     "serve",
 		Aliases: []string{"start"},
 		Short:   "Start ollama",
-		Args:    cobra.ExactArgs(0),
 		RunE:    RunServer,
 	}

 	pullCmd := &cobra.Command{
 		Use:     "pull MODEL",
 		Short:   "Pull a model from a registry",
-		Args:    cobra.ExactArgs(1),
+		Args:    cobra.MinimumNArgs(1),
 		PreRunE: checkServerHeartbeat,
 		RunE:    PullHandler,
 	}
@@ -986,7 +931,7 @@ func NewCLI() *cobra.Command {
 	pushCmd := &cobra.Command{
 		Use:     "push MODEL",
 		Short:   "Push a model to a registry",
-		Args:    cobra.ExactArgs(1),
+		Args:    cobra.MinimumNArgs(1),
 		PreRunE: checkServerHeartbeat,
 		RunE:    PushHandler,
 	}
@@ -1002,15 +947,15 @@ func NewCLI() *cobra.Command {
 	}

 	copyCmd := &cobra.Command{
-		Use:     "cp SOURCE TARGET",
+		Use:     "cp",
 		Short:   "Copy a model",
-		Args:    cobra.ExactArgs(2),
+		Args:    cobra.MinimumNArgs(2),
 		PreRunE: checkServerHeartbeat,
 		RunE:    CopyHandler,
 	}

 	deleteCmd := &cobra.Command{
-		Use:     "rm MODEL [MODEL...]",
+		Use:     "rm",
 		Short:   "Remove a model",
 		Args:    cobra.MinimumNArgs(1),
 		PreRunE: checkServerHeartbeat,
--- a/docs/api.md
+++ b/docs/api.md
@@ -12,6 +12,7 @@
 - [Push a Model](#push-a-model)
 - [Generate Embeddings](#generate-embeddings)

+
 ## Conventions

 ### Model names
@@ -39,13 +40,12 @@ Generate a response for a given prompt with a provided model. This is a streamin
 - `model`: (required) the [model name](#model-names)
 - `prompt`: the prompt to generate a response for

-Advanced parameters (optional):
+Advanced parameters:

 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
 - `system`: system prompt to (overrides what is defined in the `Modelfile`)
 - `template`: the full prompt or prompt template (overrides what is defined in the `Modelfile`)
 - `context`: the context parameter returned from a previous request to `/generate`, this can be used to keep a short conversational memory
- `stream`: if `false` the response will be be returned as a single response object, rather than a stream of objects

 ### Request

@@ -80,7 +80,6 @@ The final response in the stream also includes additional data about the generat
 - `eval_count`: number of tokens the response
 - `eval_duration`: time in nanoseconds spent generating the response
 - `context`: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory
- `response`: empty if the response was streamed, if not streamed, this will contain the full response

 To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration`.

@@ -88,7 +87,6 @@ To calculate how fast the response is generated in tokens per second (token/s),
 {
  "model": "llama2:7b",
  "created_at": "2023-08-04T19:22:45.499127Z",
-  "response": "",
  "context": [1, 2, 3],
  "done": true,
  "total_duration": 5589157167,
@@ -114,7 +112,6 @@ Create a model from a [`Modelfile`](./modelfile.md)

 - `name`: name of the model to create
 - `path`: path to the Modelfile
- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects

 ### Request

@@ -182,7 +179,7 @@ Show details about a model including modelfile, template, parameters, license, a

 ### Request

-```shell
+```shell  
 curl http://localhost:11434/api/show -d '{
  "name": "llama2:7b"
 }'
@@ -192,10 +189,10 @@ curl http://localhost:11434/api/show -d '{

 ```json
 {
-  "license": "<contents of license block>",
-  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llama2:latest\n\nFROM /Users/username/.ollama/models/blobs/sha256:8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8\nTEMPLATE \"\"\"[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] \"\"\"\nSYSTEM \"\"\"\"\"\"\nPARAMETER stop [INST]\nPARAMETER stop [/INST]\nPARAMETER stop <<SYS>>\nPARAMETER stop <</SYS>>\n",
-  "parameters": "stop                           [INST]\nstop                           [/INST]\nstop                           <<SYS>>\nstop                           <</SYS>>",
-  "template": "[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] "
+    "license": "<contents of license block>",
+    "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llama2:latest\n\nFROM /Users/username/.ollama/models/blobs/sha256:8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8\nTEMPLATE \"\"\"[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] \"\"\"\nSYSTEM \"\"\"\"\"\"\nPARAMETER stop [INST]\nPARAMETER stop [/INST]\nPARAMETER stop <<SYS>>\nPARAMETER stop <</SYS>>\n",
+    "parameters": "stop                           [INST]\nstop                           [/INST]\nstop                           <<SYS>>\nstop                           <</SYS>>",
+    "template": "[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] "
 }
 ```

@@ -248,7 +245,6 @@ Download a model from the ollama library. Cancelled pulls are resumed from where

 - `name`: name of the model to pull
 - `insecure`: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.
- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects

 ### Request

@@ -279,8 +275,7 @@ Upload a model to a model library. Requires registering for ollama.ai and adding
 ### Parameters

 - `name`: name of the model to push in the form of `<namespace>/<model>:<tag>`
- `insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development.
- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects
+- `insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development.  

 ### Request

@@ -295,16 +290,15 @@ curl -X POST http://localhost:11434/api/push -d '{
 Streaming response that starts with:

 ```json
-{ "status": "retrieving manifest" }
+{"status":"retrieving manifest"}
 ```

 and then:

 ```json
 {
-  "status": "starting upload",
-  "digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab",
-  "total": 1928429856
+"status":"starting upload","digest":"sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab",
+"total":1928429856
 }
 ```

@@ -312,10 +306,9 @@ Then there is a series of uploading responses:

 ```json
 {
-  "status": "starting upload",
-  "digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab",
-  "total": 1928429856
-}
+"status":"starting upload",
+"digest":"sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab",
+"total":1928429856}
 ```

 Finally, when the upload is complete:
@@ -355,9 +348,8 @@ curl -X POST http://localhost:11434/api/embeddings -d '{

 ```json
 {
-  "embedding": [
+  "embeddings": [
    0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313,
    0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281
  ]
-}
-```
+}```
--- a/docs/development.md
+++ b/docs/development.md
@@ -10,25 +10,25 @@ Install required tools:
 - go version 1.20 or higher
 - gcc version 11.4.0 or higher

-```bash
+```
 brew install go cmake gcc
 ```

 Get the required libraries:

-```bash
+```
 go generate ./...
 ```

 Then build ollama:

-```bash
+```
 go build .
 ```

 Now you can run `ollama`:

-```bash
+```
 ./ollama
 ```

--- a/docs/faq.md
+++ b/docs/faq.md
@@ -1,34 +1,19 @@
 # FAQ

-## How can I view the logs?
-
-On macOS:
-
-```
-cat ~/.ollama/logs/server.log
-```
-
-On Linux:
-
-```
-journalctl -u ollama
-```
-
-If you're running `ollama serve` directly, the logs will be printed to the console.
-
 ## How can I expose the Ollama server?

-```bash
+```
 OLLAMA_HOST=0.0.0.0:11435 ollama serve
 ```

 By default, Ollama allows cross origin requests from `127.0.0.1` and `0.0.0.0`. To support more origins, you can use the `OLLAMA_ORIGINS` environment variable:

-```bash
+```
 OLLAMA_ORIGINS=http://192.168.1.1:*,https://example.com ollama serve
 ```

 ## Where are models stored?

- macOS: Raw model data is stored under `~/.ollama/models`.
- Linux: Raw model data is stored under `/usr/share/ollama/.ollama/models`
+* macOS: Raw model data is stored under `~/.ollama/models`.
+* Linux: Raw model data is stored under `/usr/share/ollama/.ollama/models`
+
--- a/docs/import.md
+++ b/docs/import.md
@@ -1,163 +0,0 @@
-# Import a model
-
-This guide walks through importing a PyTorch, Safetensors or GGUF model.
-
-## Supported models
-
-Ollama supports a set of model architectures, with support for more coming soon:
-
- Llama & Mistral
- Falcon & RW
- GPT-NeoX
- BigCode
-
-To view a model's architecture, check the `config.json` file in its HuggingFace repo. You should see an entry under `architectures` (e.g. `LlamaForCausalLM`).
-
-## Importing
-
-### Step 1: Clone the HuggingFace repository (optional)
-
-If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
-
-```
-git lfs install
-git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
-cd Mistral-7B-Instruct-v0.1
-```
-
-### Step 2: Convert and quantize to a `.bin` file (optional, for PyTorch and Safetensors)
-
-If the model is in PyTorch or Safetensors format, a [Docker image](https://hub.docker.com/r/ollama/quantize) with the tooling required to convert and quantize models is available.
-
-First, Install [Docker](https://www.docker.com/get-started/).
-
-Next, to convert and quantize your model, run:
-
-```
-docker run --rm -v .:/model ollama/quantize -q q4_0 /model
-```
-
-This will output two files into the directory:
-
- `f16.bin`: the model converted to GGUF
- `q4_0.bin` the model quantized to a 4-bit quantization (we will use this file to create the Ollama model)
-
-### Step 3: Write a `Modelfile`
-
-Next, create a `Modelfile` for your model. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.
-
-```
-FROM ./q4_0.bin
-```
-
-(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
-
-```
-FROM ./q4_0.bin
-TEMPLATE "[INST] {{ .Prompt }} [/INST]"
-```
-
-### Step 4: Create the Ollama model
-
-Finally, create a model from your `Modelfile`:
-
-```
-ollama create example -f Modelfile
-```
-
-Next, test the model with `ollama run`:
-
-```
-ollama run example "What is your favourite condiment?"
-```
-
-### Step 5: Publish your model (optional – early alpha)
-
-Publishing models is in early alpha. If you'd like to publish your model to share with others, follow these steps:
-
-1. Create [an account](https://ollama.ai/signup)
-2. Run `cat ~/.ollama/id_ed25519.pub` to view your Ollama public key. Copy this to the clipboard.
-3. Add your public key to your [Ollama account](https://ollama.ai/settings/keys)
-
-Next, copy your model to your username's namespace:
-
-```
-ollama cp example <your username>/example
-```
-
-Then push the model:
-
-```
-ollama push <your username>/example
-```
-
-After publishing, your model will be available at `https://ollama.ai/<your username>/example`.
-
-## Quantization reference
-
-The quantization options are as follow (from highest highest to lowest levels of quantization). Note: some architectures such as Falcon do not support K quants.
-
- `q2_K`
- `q3_K`
- `q3_K_S`
- `q3_K_M`
- `q3_K_L`
- `q4_0` (recommended)
- `q4_1`
- `q4_K`
- `q4_K_S`
- `q4_K_M`
- `q5_0`
- `q5_1`
- `q5_K`
- `q5_K_S`
- `q5_K_M`
- `q6_K`
- `q8_0`
-
-## Manually converting & quantizing models
-
-### Prerequisites
-
-Start by cloning the `llama.cpp` repo to your machine in another directory:
-
-```
-git clone https://github.com/ggerganov/llama.cpp.git
-cd llama.cpp
-```
-
-Next, install the Python dependencies:
-
-```
-pip install -r requirements.txt
-```
-
-Finally, build the `quantize` tool:
-
-```
-make quantize
-```
-
-### Convert the model
-
-Run the correct conversion script for your model architecture:
-
-```shell
-# LlamaForCausalLM or MistralForCausalLM
-python convert.py <path to model directory>
-
-# FalconForCausalLM
-python convert-falcon-hf-to-gguf.py <path to model directory>
-
-# GPTNeoXForCausalLM
-python convert-falcon-hf-to-gguf.py <path to model directory>
-
-# GPTBigCodeForCausalLM
-python convert-starcoder-hf-to-gguf.py <path to model directory>
-```
-
-### Quantize the model
-
-```
-quantize <path to model dir>/ggml-model-f32.bin <path to model dir>/q4_0.bin q4_0
-```
--- a/docs/linux.md
+++ b/docs/linux.md
@@ -2,7 +2,7 @@

 > Note: A one line installer for Ollama is available by running:
 >
-> ```bash
+> ```
 > curl https://ollama.ai/install.sh | sh
 > ```

@@ -10,7 +10,7 @@

 Ollama is distributed as a self-contained binary. Download it to a directory in your PATH:

-```bash
+```
 sudo curl -L https://ollama.ai/download/ollama-linux-amd64 -o /usr/bin/ollama
 sudo chmod +x /usr/bin/ollama
 ```
@@ -19,13 +19,13 @@ sudo chmod +x /usr/bin/ollama

 Start Ollama by running `ollama serve`:

-```bash
+```
 ollama serve
 ```

 Once Ollama is running, run a model in another terminal session:

-```bash
+```
 ollama run llama2
 ```

@@ -35,7 +35,7 @@ ollama run llama2

 Verify that the drivers are installed by running the following command, which should print details about your GPU:

-```bash
+```
 nvidia-smi
 ```

@@ -43,7 +43,7 @@ nvidia-smi

 Create a user for Ollama:

-```bash
+```
 sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
 ```

@@ -68,7 +68,7 @@ WantedBy=default.target

 Then start the service:

-```bash
+```
 sudo systemctl daemon-reload
 sudo systemctl enable ollama
 ```
@@ -77,6 +77,7 @@ sudo systemctl enable ollama

 To view logs of Ollama running as a startup service, run:

-```bash
+```
 journalctl -u ollama
 ```
+
--- a/docs/modelfile.md
+++ b/docs/modelfile.md
@@ -1,6 +1,6 @@
 # Ollama Model File

-> Note: this `Modelfile` syntax is in development
+> Note: this model file syntax is in development

 A model file is the blueprint to create and share models with Ollama.

@@ -12,6 +12,7 @@ A model file is the blueprint to create and share models with Ollama.
  - [FROM (Required)](#from-required)
    - [Build from llama2](#build-from-llama2)
    - [Build from a bin file](#build-from-a-bin-file)
+  - [EMBED](#embed)
  - [PARAMETER](#parameter)
    - [Valid Parameters and Values](#valid-parameters-and-values)
  - [TEMPLATE](#template)
@@ -23,7 +24,7 @@ A model file is the blueprint to create and share models with Ollama.

 ## Format

-The format of the `Modelfile`:
+The format of the Modelfile:

 ```modelfile
 # comment
@@ -41,9 +42,9 @@ INSTRUCTION arguments

 ## Examples

-An example of a `Modelfile` creating a mario blueprint:
+An example of a model file creating a mario blueprint:

-```modelfile
+```
 FROM llama2
 # sets the temperature to 1 [higher is more creative, lower is more coherent]
 PARAMETER temperature 1
@@ -56,9 +57,9 @@ SYSTEM You are Mario from super mario bros, acting as an assistant.

 To use this:

-1. Save it as a file (e.g. `Modelfile`)
-2. `ollama create choose-a-model-name -f <location of the file e.g. ./Modelfile>'`
-3. `ollama run choose-a-model-name`
+1. Save it as a file (eg. `Modelfile`)
+2. `ollama create NAME -f <location of the file eg. ./Modelfile>'`
+3. `ollama run NAME`
 4. Start using the model!

 More examples are available in the [examples directory](../examples).
@@ -67,34 +68,45 @@ More examples are available in the [examples directory](../examples).

 ### FROM (Required)

-The `FROM` instruction defines the base model to use when creating a model.
+The FROM instruction defines the base model to use when creating a model.

-```modelfile
+```
 FROM <model name>:<tag>
 ```

 #### Build from llama2

-```modelfile
+```
 FROM llama2
 ```

 A list of available base models:
 <https://github.com/jmorganca/ollama#model-library>

-#### Build from a `bin` file
+#### Build from a bin file

-```modelfile
+```
 FROM ./ollama-model.bin
 ```

-This bin file location should be specified as an absolute path or relative to the `Modelfile` location.
+This bin file location should be specified as an absolute path or relative to the Modelfile location.
+
+### EMBED
+
+The EMBED instruction is used to add embeddings of files to a model. This is useful for adding custom data that the model can reference when generating an answer. Note that currently only text files are supported, formatted with each line as one embedding.
+
+```
+FROM <model name>:<tag>
+EMBED <file path>.txt
+EMBED <different file path>.txt
+EMBED <path to directory>/*.txt
+```

 ### PARAMETER

 The `PARAMETER` instruction defines a parameter that can be set when the model is run.

-```modelfile
+```
 PARAMETER <parameter> <parametervalue>
 ```

@@ -112,7 +124,6 @@ PARAMETER <parameter> <parametervalue>
 | repeat_last_n  | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)                                                                                                                                           | int        | repeat_last_n 64     |
 | repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)                                                                     | float      | repeat_penalty 1.1   |
 | temperature    | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)                                                                                                                                     | float      | temperature 0.7      |
-| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 |
 | stop           | Sets the stop sequences to use.                                                                                                                                                                                                                         | string     | stop "AI assistant:" |
 | tfs_z          | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1)                                               | float      | tfs_z 1              |
 | num_predict    | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)                                                                                                                                   | int        | num_predict 42       |
@@ -121,7 +132,7 @@ PARAMETER <parameter> <parametervalue>

 ### TEMPLATE

-`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model.
+`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific.

 #### Template Variables

@@ -131,7 +142,7 @@ PARAMETER <parameter> <parametervalue>
 | `{{ .Prompt }}` | The incoming prompt, this is not specified in the model file and will be set based on input.                 |
 | `{{ .First }}`  | A boolean value used to render specific template information for the first generation of a session.          |

-```modelfile
+```
 TEMPLATE """
 {{- if .First }}
 ### System:
@@ -151,7 +162,7 @@ SYSTEM """<system message>"""

 The `SYSTEM` instruction specifies the system prompt to be used in the template, if applicable.

-```modelfile
+```
 SYSTEM """<system message>"""
 ```

@@ -159,7 +170,7 @@ SYSTEM """<system message>"""

 The `ADAPTER` instruction specifies the LoRA adapter to apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined.

-```modelfile
+```
 ADAPTER ./ollama-lora.bin
 ```

@@ -167,7 +178,7 @@ ADAPTER ./ollama-lora.bin

 The `LICENSE` instruction allows you to specify the legal license under which the model used with this Modelfile is shared or distributed.

-```modelfile
+```
 LICENSE """
 <license text>
 """
@@ -175,5 +186,5 @@ LICENSE """

 ## Notes

- the **`Modelfile` is not case sensitive**. In the examples, we use uppercase for instructions to make it easier to distinguish it from arguments.
+- the **modelfile is not case sensitive**. In the examples, we use uppercase for instructions to make it easier to distinguish it from arguments.
 - Instructions can be in any order. In the examples, we start with FROM instruction to keep it easily readable.
--- a/examples/.gitignore
+++ b/examples/.gitignore
@@ -1,171 +0,0 @@
-node_modules
-# OSX
-.DS_STORE
-
-# Models
-models/
-
-# Local Chroma db
-.chroma/
-db/
-
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-share/python-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.nox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-*.py,cover
-.hypothesis/
-.pytest_cache/
-cover/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-.pybuilder/
-target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# IPython
-profile_default/
-ipython_config.py
-
-# pyenv
-#   For a library or package, you might want to ignore these files since the code is
-#   intended to run in multiple environments; otherwise, check them in:
-# .python-version
-
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-
-# poetry
-#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
-#   This is especially recommended for binary packages to ensure reproducibility, and is more
-#   commonly ignored for libraries.
-#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
-#poetry.lock
-
-# pdm
-#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
-#pdm.lock
-#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
-#   in version control.
-#   https://pdm.fming.dev/#use-with-ide
-.pdm.toml
-
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
-__pypackages__/
-
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-
-# Pyre type checker
-.pyre/
-
-# pytype static type analyzer
-.pytype/
-
-# Cython debug symbols
-cython_debug/
-
-# PyCharm
-#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
-#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
-#  and can be added to the global gitignore or merged into this file.  For a more nuclear
-#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
--- a/examples/modelfile-10tweets/Modelfile
+++ b/examples/modelfile-10tweets/Modelfile
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,3 +1,15 @@
 # Examples

-This directory contains different examples of using Ollama.
+This directory contains different examples of using Ollama
+
+To create a model:
+
+```
+ollama create example -f <example file>
+```
+
+To run a model:
+
+```
+ollama run example
+```
--- a/examples/modelfile-devopsengineer/Modelfile
+++ b/examples/modelfile-devopsengineer/Modelfile
@@ -1,7 +1,7 @@
 # Modelfile for creating a devops engineer assistant
 # Run `ollama create devops-engineer -f ./Modelfile` and then `ollama run devops-engineer` and enter a topic

-FROM mistral
+FROM llama2:13b
 PARAMETER temperature 1
 SYSTEM """
 You are a senior devops engineer, acting as an assistant. You offer help with cloud technologies like: Terraform, AWS, kubernetes, python. You answer with code examples when possible
--- a/examples/python-dockerit/Modelfile
+++ b/examples/python-dockerit/Modelfile
@@ -1,4 +1,4 @@
-FROM mistral
+FROM llama2
 SYSTEM """
 You are an experienced Devops engineer focused on docker. When given specifications for a particular need or application you know the best way to host that within a docker container. For instance if someone tells you they want an nginx server to host files located at /web you will answer as follows

--- a/examples/python-dockerit/README.md
+++ b/examples/python-dockerit/README.md
--- a/examples/python-dockerit/dockerit.py
+++ b/examples/python-dockerit/dockerit.py
--- a/examples/python-dockerit/requirements.txt
+++ b/examples/python-dockerit/requirements.txt
--- a/examples/golang-simplegenerate/README.md
+++ b/examples/golang-simplegenerate/README.md
--- a/examples/golang-simplegenerate/main.go
+++ b/examples/golang-simplegenerate/main.go
@@ -1,27 +0,0 @@
-package main
-
-import (
-	"bytes"
-	"fmt"
-	"io"
-	"log"
-	"net/http"
-	"os"
-)
-
-func main() {
-	body := []byte(`{"model":"mistral"}`)
-	resp, err := http.Post("http://localhost:11434/api/generate", "application/json", bytes.NewBuffer(body))
-
-	if err != nil {
-		fmt.Print(err.Error())
-		os.Exit(1)
-	}
-
-	responseData, err := io.ReadAll(resp.Body)
-	if err != nil {
-		log.Fatal(err)
-	}
-	fmt.Println(string(responseData))
-
-}
--- a/examples/langchain-python-rag-document/README.md
+++ b/examples/langchain-python-rag-document/README.md
--- a/examples/langchain-python-rag-document/main.py
+++ b/examples/langchain-python-rag-document/main.py
--- a/examples/langchain-python-rag-document/requirements.txt
+++ b/examples/langchain-python-rag-document/requirements.txt
--- a/examples/langchain-typescript-simple/README.md
+++ b/examples/langchain-typescript-simple/README.md
@@ -1,21 +0,0 @@
-# LangChain
-
-This example is a basic "hello world" of using LangChain with Ollama using Node.js and Typescript.
-
-## Setup
-
-```shell
-npm install
-```
-
-## Run
-
-```shell
-ts-node main.ts
-```
-
-Running this example will print the response for "hello":
-
-```plaintext
-Hello! It's nice to meet you. hopefully you are having a great day! Is there something I can help you with or would you like to chat?
-```
--- a/examples/langchain-typescript-simple/main.ts
+++ b/examples/langchain-typescript-simple/main.ts
@@ -1,15 +0,0 @@
-import { Ollama} from 'langchain/llms/ollama';
-
-async function main() {
-  const ollama = new Ollama({
-    model: 'mistral'    
-    // other parameters can be found at https://js.langchain.com/docs/api/llms_ollama/classes/Ollama
-  })
-  const stream = await ollama.stream("Hello");
-
-  for await (const chunk of stream) {
-    process.stdout.write(chunk);
-  }
-}
-
-main();
--- a/examples/langchain-typescript-simple/package-lock.json
+++ b/examples/langchain-typescript-simple/package-lock.json
@@ -1,997 +0,0 @@
-{
-  "name": "with-langchain-typescript-simplegenerate",
-  "lockfileVersion": 3,
-  "requires": true,
-  "packages": {
-    "": {
-      "dependencies": {
-        "langchain": "^0.0.165"
-      },
-      "devDependencies": {
-        "typescript": "^5.2.2"
-      }
-    },
-    "node_modules/@anthropic-ai/sdk": {
-      "version": "0.6.2",
-      "resolved": "https://registry.npmjs.org/@anthropic-ai/sdk/-/sdk-0.6.2.tgz",
-      "integrity": "sha512-fB9PUj9RFT+XjkL+E9Ol864ZIJi+1P8WnbHspN3N3/GK2uSzjd0cbVIKTGgf4v3N8MwaQu+UWnU7C4BG/fap/g==",
-      "dependencies": {
-        "@types/node": "^18.11.18",
-        "@types/node-fetch": "^2.6.4",
-        "abort-controller": "^3.0.0",
-        "agentkeepalive": "^4.2.1",
-        "digest-fetch": "^1.3.0",
-        "form-data-encoder": "1.7.2",
-        "formdata-node": "^4.3.2",
-        "node-fetch": "^2.6.7"
-      }
-    },
-    "node_modules/@types/node": {
-      "version": "18.18.4",
-      "resolved": "https://registry.npmjs.org/@types/node/-/node-18.18.4.tgz",
-      "integrity": "sha512-t3rNFBgJRugIhackit2mVcLfF6IRc0JE4oeizPQL8Zrm8n2WY/0wOdpOPhdtG0V9Q2TlW/axbF1MJ6z+Yj/kKQ=="
-    },
-    "node_modules/@types/node-fetch": {
-      "version": "2.6.6",
-      "resolved": "https://registry.npmjs.org/@types/node-fetch/-/node-fetch-2.6.6.tgz",
-      "integrity": "sha512-95X8guJYhfqiuVVhRFxVQcf4hW/2bCuoPwDasMf/531STFoNoWTT7YDnWdXHEZKqAGUigmpG31r2FE70LwnzJw==",
-      "dependencies": {
-        "@types/node": "*",
-        "form-data": "^4.0.0"
-      }
-    },
-    "node_modules/@types/retry": {
-      "version": "0.12.0",
-      "resolved": "https://registry.npmjs.org/@types/retry/-/retry-0.12.0.tgz",
-      "integrity": "sha512-wWKOClTTiizcZhXnPY4wikVAwmdYHp8q6DmC+EJUzAMsycb7HB32Kh9RN4+0gExjmPmZSAQjgURXIGATPegAvA=="
-    },
-    "node_modules/@types/uuid": {
-      "version": "9.0.5",
-      "resolved": "https://registry.npmjs.org/@types/uuid/-/uuid-9.0.5.tgz",
-      "integrity": "sha512-xfHdwa1FMJ082prjSJpoEI57GZITiQz10r3vEJCHa2khEFQjKy91aWKz6+zybzssCvXUwE1LQWgWVwZ4nYUvHQ=="
-    },
-    "node_modules/abort-controller": {
-      "version": "3.0.0",
-      "resolved": "https://registry.npmjs.org/abort-controller/-/abort-controller-3.0.0.tgz",
-      "integrity": "sha512-h8lQ8tacZYnR3vNQTgibj+tODHI5/+l06Au2Pcriv/Gmet0eaj4TwWH41sO9wnHDiQsEj19q0drzdWdeAHtweg==",
-      "dependencies": {
-        "event-target-shim": "^5.0.0"
-      },
-      "engines": {
-        "node": ">=6.5"
-      }
-    },
-    "node_modules/agentkeepalive": {
-      "version": "4.5.0",
-      "resolved": "https://registry.npmjs.org/agentkeepalive/-/agentkeepalive-4.5.0.tgz",
-      "integrity": "sha512-5GG/5IbQQpC9FpkRGsSvZI5QYeSCzlJHdpBQntCsuTOxhKD8lqKhrleg2Yi7yvMIf82Ycmmqln9U8V9qwEiJew==",
-      "dependencies": {
-        "humanize-ms": "^1.2.1"
-      },
-      "engines": {
-        "node": ">= 8.0.0"
-      }
-    },
-    "node_modules/ansi-styles": {
-      "version": "5.2.0",
-      "resolved": "https://registry.npmjs.org/ansi-styles/-/ansi-styles-5.2.0.tgz",
-      "integrity": "sha512-Cxwpt2SfTzTtXcfOlzGEee8O+c+MmUgGrNiBcXnuWxuFJHe6a5Hz7qwhwe5OgaSYI0IJvkLqWX1ASG+cJOkEiA==",
-      "engines": {
-        "node": ">=10"
-      },
-      "funding": {
-        "url": "https://github.com/chalk/ansi-styles?sponsor=1"
-      }
-    },
-    "node_modules/argparse": {
-      "version": "2.0.1",
-      "resolved": "https://registry.npmjs.org/argparse/-/argparse-2.0.1.tgz",
-      "integrity": "sha512-8+9WqebbFzpX9OR+Wa6O29asIogeRMzcGtAINdpMHHyAg10f05aSFVBbcEqGf/PXw1EjAZ+q2/bEBg3DvurK3Q=="
-    },
-    "node_modules/asynckit": {
-      "version": "0.4.0",
-      "resolved": "https://registry.npmjs.org/asynckit/-/asynckit-0.4.0.tgz",
-      "integrity": "sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q=="
-    },
-    "node_modules/base-64": {
-      "version": "0.1.0",
-      "resolved": "https://registry.npmjs.org/base-64/-/base-64-0.1.0.tgz",
-      "integrity": "sha512-Y5gU45svrR5tI2Vt/X9GPd3L0HNIKzGu202EjxrXMpuc2V2CiKgemAbUUsqYmZJvPtCXoUKjNZwBJzsNScUbXA=="
-    },
-    "node_modules/base64-js": {
-      "version": "1.5.1",
-      "resolved": "https://registry.npmjs.org/base64-js/-/base64-js-1.5.1.tgz",
-      "integrity": "sha512-AKpaYlHn8t4SVbOHCy+b5+KKgvR4vrsD8vbvrbiQJps7fKDTkjkDry6ji0rUJjC0kzbNePLwzxq8iypo41qeWA==",
-      "funding": [
-        {
-          "type": "github",
-          "url": "https://github.com/sponsors/feross"
-        },
-        {
-          "type": "patreon",
-          "url": "https://www.patreon.com/feross"
-        },
-        {
-          "type": "consulting",
-          "url": "https://feross.org/support"
-        }
-      ]
-    },
-    "node_modules/binary-extensions": {
-      "version": "2.2.0",
-      "resolved": "https://registry.npmjs.org/binary-extensions/-/binary-extensions-2.2.0.tgz",
-      "integrity": "sha512-jDctJ/IVQbZoJykoeHbhXpOlNBqGNcwXJKJog42E5HDPUwQTSdjCHdihjj0DlnheQ7blbT6dHOafNAiS8ooQKA==",
-      "engines": {
-        "node": ">=8"
-      }
-    },
-    "node_modules/binary-search": {
-      "version": "1.3.6",
-      "resolved": "https://registry.npmjs.org/binary-search/-/binary-search-1.3.6.tgz",
-      "integrity": "sha512-nbE1WxOTTrUWIfsfZ4aHGYu5DOuNkbxGokjV6Z2kxfJK3uaAb8zNK1muzOeipoLHZjInT4Br88BHpzevc681xA=="
-    },
-    "node_modules/camelcase": {
-      "version": "6.3.0",
-      "resolved": "https://registry.npmjs.org/camelcase/-/camelcase-6.3.0.tgz",
-      "integrity": "sha512-Gmy6FhYlCY7uOElZUSbxo2UCDH8owEk996gkbrpsgGtrJLM3J7jGxl9Ic7Qwwj4ivOE5AWZWRMecDdF7hqGjFA==",
-      "engines": {
-        "node": ">=10"
-      },
-      "funding": {
-        "url": "https://github.com/sponsors/sindresorhus"
-      }
-    },
-    "node_modules/charenc": {
-      "version": "0.0.2",
-      "resolved": "https://registry.npmjs.org/charenc/-/charenc-0.0.2.tgz",
-      "integrity": "sha512-yrLQ/yVUFXkzg7EDQsPieE/53+0RlaWTs+wBrvW36cyilJ2SaDWfl4Yj7MtLTXleV9uEKefbAGUPv2/iWSooRA==",
-      "engines": {
-        "node": "*"
-      }
-    },
-    "node_modules/combined-stream": {
-      "version": "1.0.8",
-      "resolved": "https://registry.npmjs.org/combined-stream/-/combined-stream-1.0.8.tgz",
-      "integrity": "sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg==",
-      "dependencies": {
-        "delayed-stream": "~1.0.0"
-      },
-      "engines": {
-        "node": ">= 0.8"
-      }
-    },
-    "node_modules/commander": {
-      "version": "10.0.1",
-      "resolved": "https://registry.npmjs.org/commander/-/commander-10.0.1.tgz",
-      "integrity": "sha512-y4Mg2tXshplEbSGzx7amzPwKKOCGuoSRP/CjEdwwk0FOGlUbq6lKuoyDZTNZkmxHdJtp54hdfY/JUrdL7Xfdug==",
-      "engines": {
-        "node": ">=14"
-      }
-    },
-    "node_modules/crypt": {
-      "version": "0.0.2",
-      "resolved": "https://registry.npmjs.org/crypt/-/crypt-0.0.2.tgz",
-      "integrity": "sha512-mCxBlsHFYh9C+HVpiEacem8FEBnMXgU9gy4zmNC+SXAZNB/1idgp/aulFJ4FgCi7GPEVbfyng092GqL2k2rmow==",
-      "engines": {
-        "node": "*"
-      }
-    },
-    "node_modules/decamelize": {
-      "version": "1.2.0",
-      "resolved": "https://registry.npmjs.org/decamelize/-/decamelize-1.2.0.tgz",
-      "integrity": "sha512-z2S+W9X73hAUUki+N+9Za2lBlun89zigOyGrsax+KUQ6wKW4ZoWpEYBkGhQjwAjjDCkWxhY0VKEhk8wzY7F5cA==",
-      "engines": {
-        "node": ">=0.10.0"
-      }
-    },
-    "node_modules/delayed-stream": {
-      "version": "1.0.0",
-      "resolved": "https://registry.npmjs.org/delayed-stream/-/delayed-stream-1.0.0.tgz",
-      "integrity": "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ==",
-      "engines": {
-        "node": ">=0.4.0"
-      }
-    },
-    "node_modules/digest-fetch": {
-      "version": "1.3.0",
-      "resolved": "https://registry.npmjs.org/digest-fetch/-/digest-fetch-1.3.0.tgz",
-      "integrity": "sha512-CGJuv6iKNM7QyZlM2T3sPAdZWd/p9zQiRNS9G+9COUCwzWFTs0Xp8NF5iePx7wtvhDykReiRRrSeNb4oMmB8lA==",
-      "dependencies": {
-        "base-64": "^0.1.0",
-        "md5": "^2.3.0"
-      }
-    },
-    "node_modules/event-target-shim": {
-      "version": "5.0.1",
-      "resolved": "https://registry.npmjs.org/event-target-shim/-/event-target-shim-5.0.1.tgz",
-      "integrity": "sha512-i/2XbnSz/uxRCU6+NdVJgKWDTM427+MqYbkQzD321DuCQJUqOuJKIA0IM2+W2xtYHdKOmZ4dR6fExsd4SXL+WQ==",
-      "engines": {
-        "node": ">=6"
-      }
-    },
-    "node_modules/eventemitter3": {
-      "version": "4.0.7",
-      "resolved": "https://registry.npmjs.org/eventemitter3/-/eventemitter3-4.0.7.tgz",
-      "integrity": "sha512-8guHBZCwKnFhYdHr2ysuRWErTwhoN2X8XELRlrRwpmfeY2jjuUN4taQMsULKUVo1K4DvZl+0pgfyoysHxvmvEw=="
-    },
-    "node_modules/expr-eval": {
-      "version": "2.0.2",
-      "resolved": "https://registry.npmjs.org/expr-eval/-/expr-eval-2.0.2.tgz",
-      "integrity": "sha512-4EMSHGOPSwAfBiibw3ndnP0AvjDWLsMvGOvWEZ2F96IGk0bIVdjQisOHxReSkE13mHcfbuCiXw+G4y0zv6N8Eg=="
-    },
-    "node_modules/flat": {
-      "version": "5.0.2",
-      "resolved": "https://registry.npmjs.org/flat/-/flat-5.0.2.tgz",
-      "integrity": "sha512-b6suED+5/3rTpUBdG1gupIl8MPFCAMA0QXwmljLhvCUKcUvdE4gWky9zpuGCcXHOsz4J9wPGNWq6OKpmIzz3hQ==",
-      "bin": {
-        "flat": "cli.js"
-      }
-    },
-    "node_modules/form-data": {
-      "version": "4.0.0",
-      "resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.0.tgz",
-      "integrity": "sha512-ETEklSGi5t0QMZuiXoA/Q6vcnxcLQP5vdugSpuAyi6SVGi2clPPp+xgEhuMaHC+zGgn31Kd235W35f7Hykkaww==",
-      "dependencies": {
-        "asynckit": "^0.4.0",
-        "combined-stream": "^1.0.8",
-        "mime-types": "^2.1.12"
-      },
-      "engines": {
-        "node": ">= 6"
-      }
-    },
-    "node_modules/form-data-encoder": {
-      "version": "1.7.2",
-      "resolved": "https://registry.npmjs.org/form-data-encoder/-/form-data-encoder-1.7.2.tgz",
-      "integrity": "sha512-qfqtYan3rxrnCk1VYaA4H+Ms9xdpPqvLZa6xmMgFvhO32x7/3J/ExcTd6qpxM0vH2GdMI+poehyBZvqfMTto8A=="
-    },
-    "node_modules/formdata-node": {
-      "version": "4.4.1",
-      "resolved": "https://registry.npmjs.org/formdata-node/-/formdata-node-4.4.1.tgz",
-      "integrity": "sha512-0iirZp3uVDjVGt9p49aTaqjk84TrglENEDuqfdlZQ1roC9CWlPk6Avf8EEnZNcAqPonwkG35x4n3ww/1THYAeQ==",
-      "dependencies": {
-        "node-domexception": "1.0.0",
-        "web-streams-polyfill": "4.0.0-beta.3"
-      },
-      "engines": {
-        "node": ">= 12.20"
-      }
-    },
-    "node_modules/humanize-ms": {
-      "version": "1.2.1",
-      "resolved": "https://registry.npmjs.org/humanize-ms/-/humanize-ms-1.2.1.tgz",
-      "integrity": "sha512-Fl70vYtsAFb/C06PTS9dZBo7ihau+Tu/DNCk/OyHhea07S+aeMWpFFkUaXRa8fI+ScZbEI8dfSxwY7gxZ9SAVQ==",
-      "dependencies": {
-        "ms": "^2.0.0"
-      }
-    },
-    "node_modules/is-any-array": {
-      "version": "2.0.1",
-      "resolved": "https://registry.npmjs.org/is-any-array/-/is-any-array-2.0.1.tgz",
-      "integrity": "sha512-UtilS7hLRu++wb/WBAw9bNuP1Eg04Ivn1vERJck8zJthEvXCBEBpGR/33u/xLKWEQf95803oalHrVDptcAvFdQ=="
-    },
-    "node_modules/is-buffer": {
-      "version": "1.1.6",
-      "resolved": "https://registry.npmjs.org/is-buffer/-/is-buffer-1.1.6.tgz",
-      "integrity": "sha512-NcdALwpXkTm5Zvvbk7owOUSvVvBKDgKP5/ewfXEznmQFfs4ZRmanOeKBTjRVjka3QFoN6XJ+9F3USqfHqTaU5w=="
-    },
-    "node_modules/js-tiktoken": {
-      "version": "1.0.7",
-      "resolved": "https://registry.npmjs.org/js-tiktoken/-/js-tiktoken-1.0.7.tgz",
-      "integrity": "sha512-biba8u/clw7iesNEWLOLwrNGoBP2lA+hTaBLs/D45pJdUPFXyxD6nhcDVtADChghv4GgyAiMKYMiRx7x6h7Biw==",
-      "dependencies": {
-        "base64-js": "^1.5.1"
-      }
-    },
-    "node_modules/js-yaml": {
-      "version": "4.1.0",
-      "resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-4.1.0.tgz",
-      "integrity": "sha512-wpxZs9NoxZaJESJGIZTyDEaYpl0FKSA+FB9aJiyemKhMwkxQg63h4T1KJgUGHpTqPDNRcmmYLugrRjJlBtWvRA==",
-      "dependencies": {
-        "argparse": "^2.0.1"
-      },
-      "bin": {
-        "js-yaml": "bin/js-yaml.js"
-      }
-    },
-    "node_modules/jsonpointer": {
-      "version": "5.0.1",
-      "resolved": "https://registry.npmjs.org/jsonpointer/-/jsonpointer-5.0.1.tgz",
-      "integrity": "sha512-p/nXbhSEcu3pZRdkW1OfJhpsVtW1gd4Wa1fnQc9YLiTfAjn0312eMKimbdIQzuZl9aa9xUGaRlP9T/CJE/ditQ==",
-      "engines": {
-        "node": ">=0.10.0"
-      }
-    },
-    "node_modules/langchain": {
-      "version": "0.0.165",
-      "resolved": "https://registry.npmjs.org/langchain/-/langchain-0.0.165.tgz",
-      "integrity": "sha512-CpbNpjwaE+9lzjdw+pZz0VgnRrFivEgr7CVp9dDaAb5JpaJAA4V2v6uQ9ZPN+TSqupTQ79HFn2sfyZVEl2EG7Q==",
-      "dependencies": {
-        "@anthropic-ai/sdk": "^0.6.2",
-        "ansi-styles": "^5.0.0",
-        "binary-extensions": "^2.2.0",
-        "camelcase": "6",
-        "decamelize": "^1.2.0",
-        "expr-eval": "^2.0.2",
-        "flat": "^5.0.2",
-        "js-tiktoken": "^1.0.7",
-        "js-yaml": "^4.1.0",
-        "jsonpointer": "^5.0.1",
-        "langchainhub": "~0.0.6",
-        "langsmith": "~0.0.31",
-        "ml-distance": "^4.0.0",
-        "object-hash": "^3.0.0",
-        "openai": "~4.4.0",
-        "openapi-types": "^12.1.3",
-        "p-queue": "^6.6.2",
-        "p-retry": "4",
-        "uuid": "^9.0.0",
-        "yaml": "^2.2.1",
-        "zod": "^3.22.3",
-        "zod-to-json-schema": "^3.20.4"
-      },
-      "engines": {
-        "node": ">=18"
-      },
-      "peerDependencies": {
-        "@aws-crypto/sha256-js": "^5.0.0",
-        "@aws-sdk/client-bedrock-runtime": "^3.422.0",
-        "@aws-sdk/client-dynamodb": "^3.310.0",
-        "@aws-sdk/client-kendra": "^3.352.0",
-        "@aws-sdk/client-lambda": "^3.310.0",
-        "@aws-sdk/client-s3": "^3.310.0",
-        "@aws-sdk/client-sagemaker-runtime": "^3.310.0",
-        "@aws-sdk/client-sfn": "^3.310.0",
-        "@aws-sdk/credential-provider-node": "^3.388.0",
-        "@azure/storage-blob": "^12.15.0",
-        "@clickhouse/client": "^0.0.14",
-        "@cloudflare/ai": "^1.0.12",
-        "@elastic/elasticsearch": "^8.4.0",
-        "@getmetal/metal-sdk": "*",
-        "@getzep/zep-js": "^0.7.0",
-        "@gomomento/sdk": "^1.23.0",
-        "@google-ai/generativelanguage": "^0.2.1",
-        "@google-cloud/storage": "^6.10.1",
-        "@huggingface/inference": "^1.5.1",
-        "@mozilla/readability": "*",
-        "@notionhq/client": "^2.2.10",
-        "@opensearch-project/opensearch": "*",
-        "@pinecone-database/pinecone": "^1.1.0",
-        "@planetscale/database": "^1.8.0",
-        "@qdrant/js-client-rest": "^1.2.0",
-        "@raycast/api": "^1.55.2",
-        "@smithy/eventstream-codec": "^2.0.5",
-        "@smithy/protocol-http": "^3.0.6",
-        "@smithy/signature-v4": "^2.0.10",
-        "@smithy/util-utf8": "^2.0.0",
-        "@supabase/postgrest-js": "^1.1.1",
-        "@supabase/supabase-js": "^2.10.0",
-        "@tensorflow-models/universal-sentence-encoder": "*",
-        "@tensorflow/tfjs-converter": "*",
-        "@tensorflow/tfjs-core": "*",
-        "@upstash/redis": "^1.20.6",
-        "@vercel/postgres": "^0.5.0",
-        "@writerai/writer-sdk": "^0.40.2",
-        "@xata.io/client": "^0.25.1",
-        "@xenova/transformers": "^2.5.4",
-        "@zilliz/milvus2-sdk-node": ">=2.2.7",
-        "apify-client": "^2.7.1",
-        "axios": "*",
-        "cassandra-driver": "^4.6.4",
-        "cheerio": "^1.0.0-rc.12",
-        "chromadb": "*",
-        "cohere-ai": ">=6.0.0",
-        "d3-dsv": "^2.0.0",
-        "epub2": "^3.0.1",
-        "faiss-node": "^0.3.0",
-        "fast-xml-parser": "^4.2.7",
-        "firebase-admin": "^11.9.0",
-        "google-auth-library": "^8.9.0",
-        "googleapis": "^126.0.1",
-        "hnswlib-node": "^1.4.2",
-        "html-to-text": "^9.0.5",
-        "ignore": "^5.2.0",
-        "ioredis": "^5.3.2",
-        "jsdom": "*",
-        "llmonitor": "*",
-        "lodash": "^4.17.21",
-        "mammoth": "*",
-        "mongodb": "^5.2.0",
-        "mysql2": "^3.3.3",
-        "neo4j-driver": "*",
-        "node-llama-cpp": "*",
-        "notion-to-md": "^3.1.0",
-        "pdf-parse": "1.1.1",
-        "peggy": "^3.0.2",
-        "pg": "^8.11.0",
-        "pg-copy-streams": "^6.0.5",
-        "pickleparser": "^0.1.0",
-        "playwright": "^1.32.1",
-        "portkey-ai": "^0.1.11",
-        "puppeteer": "^19.7.2",
-        "redis": "^4.6.4",
-        "replicate": "^0.18.0",
-        "sonix-speech-recognition": "^2.1.1",
-        "srt-parser-2": "^1.2.2",
-        "typeorm": "^0.3.12",
-        "typesense": "^1.5.3",
-        "usearch": "^1.1.1",
-        "vectordb": "^0.1.4",
-        "voy-search": "0.6.2",
-        "weaviate-ts-client": "^1.4.0",
-        "web-auth-library": "^1.0.3",
-        "youtube-transcript": "^1.0.6",
-        "youtubei.js": "^5.8.0"
-      },
-      "peerDependenciesMeta": {
-        "@aws-crypto/sha256-js": {
-          "optional": true
-        },
-        "@aws-sdk/client-bedrock-runtime": {
-          "optional": true
-        },
-        "@aws-sdk/client-dynamodb": {
-          "optional": true
-        },
-        "@aws-sdk/client-kendra": {
-          "optional": true
-        },
-        "@aws-sdk/client-lambda": {
-          "optional": true
-        },
-        "@aws-sdk/client-s3": {
-          "optional": true
-        },
-        "@aws-sdk/client-sagemaker-runtime": {
-          "optional": true
-        },
-        "@aws-sdk/client-sfn": {
-          "optional": true
-        },
-        "@aws-sdk/credential-provider-node": {
-          "optional": true
-        },
-        "@azure/storage-blob": {
-          "optional": true
-        },
-        "@clickhouse/client": {
-          "optional": true
-        },
-        "@cloudflare/ai": {
-          "optional": true
-        },
-        "@elastic/elasticsearch": {
-          "optional": true
-        },
-        "@getmetal/metal-sdk": {
-          "optional": true
-        },
-        "@getzep/zep-js": {
-          "optional": true
-        },
-        "@gomomento/sdk": {
-          "optional": true
-        },
-        "@google-ai/generativelanguage": {
-          "optional": true
-        },
-        "@google-cloud/storage": {
-          "optional": true
-        },
-        "@huggingface/inference": {
-          "optional": true
-        },
-        "@mozilla/readability": {
-          "optional": true
-        },
-        "@notionhq/client": {
-          "optional": true
-        },
-        "@opensearch-project/opensearch": {
-          "optional": true
-        },
-        "@pinecone-database/pinecone": {
-          "optional": true
-        },
-        "@planetscale/database": {
-          "optional": true
-        },
-        "@qdrant/js-client-rest": {
-          "optional": true
-        },
-        "@raycast/api": {
-          "optional": true
-        },
-        "@smithy/eventstream-codec": {
-          "optional": true
-        },
-        "@smithy/protocol-http": {
-          "optional": true
-        },
-        "@smithy/signature-v4": {
-          "optional": true
-        },
-        "@smithy/util-utf8": {
-          "optional": true
-        },
-        "@supabase/postgrest-js": {
-          "optional": true
-        },
-        "@supabase/supabase-js": {
-          "optional": true
-        },
-        "@tensorflow-models/universal-sentence-encoder": {
-          "optional": true
-        },
-        "@tensorflow/tfjs-converter": {
-          "optional": true
-        },
-        "@tensorflow/tfjs-core": {
-          "optional": true
-        },
-        "@upstash/redis": {
-          "optional": true
-        },
-        "@vercel/postgres": {
-          "optional": true
-        },
-        "@writerai/writer-sdk": {
-          "optional": true
-        },
-        "@xata.io/client": {
-          "optional": true
-        },
-        "@xenova/transformers": {
-          "optional": true
-        },
-        "@zilliz/milvus2-sdk-node": {
-          "optional": true
-        },
-        "apify-client": {
-          "optional": true
-        },
-        "axios": {
-          "optional": true
-        },
-        "cassandra-driver": {
-          "optional": true
-        },
-        "cheerio": {
-          "optional": true
-        },
-        "chromadb": {
-          "optional": true
-        },
-        "cohere-ai": {
-          "optional": true
-        },
-        "d3-dsv": {
-          "optional": true
-        },
-        "epub2": {
-          "optional": true
-        },
-        "faiss-node": {
-          "optional": true
-        },
-        "fast-xml-parser": {
-          "optional": true
-        },
-        "firebase-admin": {
-          "optional": true
-        },
-        "google-auth-library": {
-          "optional": true
-        },
-        "googleapis": {
-          "optional": true
-        },
-        "hnswlib-node": {
-          "optional": true
-        },
-        "html-to-text": {
-          "optional": true
-        },
-        "ignore": {
-          "optional": true
-        },
-        "ioredis": {
-          "optional": true
-        },
-        "jsdom": {
-          "optional": true
-        },
-        "llmonitor": {
-          "optional": true
-        },
-        "lodash": {
-          "optional": true
-        },
-        "mammoth": {
-          "optional": true
-        },
-        "mongodb": {
-          "optional": true
-        },
-        "mysql2": {
-          "optional": true
-        },
-        "neo4j-driver": {
-          "optional": true
-        },
-        "node-llama-cpp": {
-          "optional": true
-        },
-        "notion-to-md": {
-          "optional": true
-        },
-        "pdf-parse": {
-          "optional": true
-        },
-        "peggy": {
-          "optional": true
-        },
-        "pg": {
-          "optional": true
-        },
-        "pg-copy-streams": {
-          "optional": true
-        },
-        "pickleparser": {
-          "optional": true
-        },
-        "playwright": {
-          "optional": true
-        },
-        "portkey-ai": {
-          "optional": true
-        },
-        "puppeteer": {
-          "optional": true
-        },
-        "redis": {
-          "optional": true
-        },
-        "replicate": {
-          "optional": true
-        },
-        "sonix-speech-recognition": {
-          "optional": true
-        },
-        "srt-parser-2": {
-          "optional": true
-        },
-        "typeorm": {
-          "optional": true
-        },
-        "typesense": {
-          "optional": true
-        },
-        "usearch": {
-          "optional": true
-        },
-        "vectordb": {
-          "optional": true
-        },
-        "voy-search": {
-          "optional": true
-        },
-        "weaviate-ts-client": {
-          "optional": true
-        },
-        "web-auth-library": {
-          "optional": true
-        },
-        "youtube-transcript": {
-          "optional": true
-        },
-        "youtubei.js": {
-          "optional": true
-        }
-      }
-    },
-    "node_modules/langchainhub": {
-      "version": "0.0.6",
-      "resolved": "https://registry.npmjs.org/langchainhub/-/langchainhub-0.0.6.tgz",
-      "integrity": "sha512-SW6105T+YP1cTe0yMf//7kyshCgvCTyFBMTgH2H3s9rTAR4e+78DA/BBrUL/Mt4Q5eMWui7iGuAYb3pgGsdQ9w=="
-    },
-    "node_modules/langsmith": {
-      "version": "0.0.42",
-      "resolved": "https://registry.npmjs.org/langsmith/-/langsmith-0.0.42.tgz",
-      "integrity": "sha512-sFuN+e7E+pPBIRaRgFqZh/BRBWNHTZNAwi6uj4kydQawooCZYoJmM5snOkiQrhVSvAhgu6xFhLvmfvkPcKzD7w==",
-      "dependencies": {
-        "@types/uuid": "^9.0.1",
-        "commander": "^10.0.1",
-        "p-queue": "^6.6.2",
-        "p-retry": "4",
-        "uuid": "^9.0.0"
-      },
-      "bin": {
-        "langsmith": "dist/cli/main.cjs"
-      }
-    },
-    "node_modules/md5": {
-      "version": "2.3.0",
-      "resolved": "https://registry.npmjs.org/md5/-/md5-2.3.0.tgz",
-      "integrity": "sha512-T1GITYmFaKuO91vxyoQMFETst+O71VUPEU3ze5GNzDm0OWdP8v1ziTaAEPUr/3kLsY3Sftgz242A1SetQiDL7g==",
-      "dependencies": {
-        "charenc": "0.0.2",
-        "crypt": "0.0.2",
-        "is-buffer": "~1.1.6"
-      }
-    },
-    "node_modules/mime-db": {
-      "version": "1.52.0",
-      "resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.52.0.tgz",
-      "integrity": "sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg==",
-      "engines": {
-        "node": ">= 0.6"
-      }
-    },
-    "node_modules/mime-types": {
-      "version": "2.1.35",
-      "resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.35.tgz",
-      "integrity": "sha512-ZDY+bPm5zTTF+YpCrAU9nK0UgICYPT0QtT1NZWFv4s++TNkcgVaT0g6+4R2uI4MjQjzysHB1zxuWL50hzaeXiw==",
-      "dependencies": {
-        "mime-db": "1.52.0"
-      },
-      "engines": {
-        "node": ">= 0.6"
-      }
-    },
-    "node_modules/ml-array-mean": {
-      "version": "1.1.6",
-      "resolved": "https://registry.npmjs.org/ml-array-mean/-/ml-array-mean-1.1.6.tgz",
-      "integrity": "sha512-MIdf7Zc8HznwIisyiJGRH9tRigg3Yf4FldW8DxKxpCCv/g5CafTw0RRu51nojVEOXuCQC7DRVVu5c7XXO/5joQ==",
-      "dependencies": {
-        "ml-array-sum": "^1.1.6"
-      }
-    },
-    "node_modules/ml-array-sum": {
-      "version": "1.1.6",
-      "resolved": "https://registry.npmjs.org/ml-array-sum/-/ml-array-sum-1.1.6.tgz",
-      "integrity": "sha512-29mAh2GwH7ZmiRnup4UyibQZB9+ZLyMShvt4cH4eTK+cL2oEMIZFnSyB3SS8MlsTh6q/w/yh48KmqLxmovN4Dw==",
-      "dependencies": {
-        "is-any-array": "^2.0.0"
-      }
-    },
-    "node_modules/ml-distance": {
-      "version": "4.0.1",
-      "resolved": "https://registry.npmjs.org/ml-distance/-/ml-distance-4.0.1.tgz",
-      "integrity": "sha512-feZ5ziXs01zhyFUUUeZV5hwc0f5JW0Sh0ckU1koZe/wdVkJdGxcP06KNQuF0WBTj8FttQUzcvQcpcrOp/XrlEw==",
-      "dependencies": {
-        "ml-array-mean": "^1.1.6",
-        "ml-distance-euclidean": "^2.0.0",
-        "ml-tree-similarity": "^1.0.0"
-      }
-    },
-    "node_modules/ml-distance-euclidean": {
-      "version": "2.0.0",
-      "resolved": "https://registry.npmjs.org/ml-distance-euclidean/-/ml-distance-euclidean-2.0.0.tgz",
-      "integrity": "sha512-yC9/2o8QF0A3m/0IXqCTXCzz2pNEzvmcE/9HFKOZGnTjatvBbsn4lWYJkxENkA4Ug2fnYl7PXQxnPi21sgMy/Q=="
-    },
-    "node_modules/ml-tree-similarity": {
-      "version": "1.0.0",
-      "resolved": "https://registry.npmjs.org/ml-tree-similarity/-/ml-tree-similarity-1.0.0.tgz",
-      "integrity": "sha512-XJUyYqjSuUQkNQHMscr6tcjldsOoAekxADTplt40QKfwW6nd++1wHWV9AArl0Zvw/TIHgNaZZNvr8QGvE8wLRg==",
-      "dependencies": {
-        "binary-search": "^1.3.5",
-        "num-sort": "^2.0.0"
-      }
-    },
-    "node_modules/ms": {
-      "version": "2.1.3",
-      "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.3.tgz",
-      "integrity": "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA=="
-    },
-    "node_modules/node-domexception": {
-      "version": "1.0.0",
-      "resolved": "https://registry.npmjs.org/node-domexception/-/node-domexception-1.0.0.tgz",
-      "integrity": "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ==",
-      "funding": [
-        {
-          "type": "github",
-          "url": "https://github.com/sponsors/jimmywarting"
-        },
-        {
-          "type": "github",
-          "url": "https://paypal.me/jimmywarting"
-        }
-      ],
-      "engines": {
-        "node": ">=10.5.0"
-      }
-    },
-    "node_modules/node-fetch": {
-      "version": "2.7.0",
-      "resolved": "https://registry.npmjs.org/node-fetch/-/node-fetch-2.7.0.tgz",
-      "integrity": "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A==",
-      "dependencies": {
-        "whatwg-url": "^5.0.0"
-      },
-      "engines": {
-        "node": "4.x || >=6.0.0"
-      },
-      "peerDependencies": {
-        "encoding": "^0.1.0"
-      },
-      "peerDependenciesMeta": {
-        "encoding": {
-          "optional": true
-        }
-      }
-    },
-    "node_modules/num-sort": {
-      "version": "2.1.0",
-      "resolved": "https://registry.npmjs.org/num-sort/-/num-sort-2.1.0.tgz",
-      "integrity": "sha512-1MQz1Ed8z2yckoBeSfkQHHO9K1yDRxxtotKSJ9yvcTUUxSvfvzEq5GwBrjjHEpMlq/k5gvXdmJ1SbYxWtpNoVg==",
-      "engines": {
-        "node": ">=8"
-      },
-      "funding": {
-        "url": "https://github.com/sponsors/sindresorhus"
-      }
-    },
-    "node_modules/object-hash": {
-      "version": "3.0.0",
-      "resolved": "https://registry.npmjs.org/object-hash/-/object-hash-3.0.0.tgz",
-      "integrity": "sha512-RSn9F68PjH9HqtltsSnqYC1XXoWe9Bju5+213R98cNGttag9q9yAOTzdbsqvIa7aNm5WffBZFpWYr2aWrklWAw==",
-      "engines": {
-        "node": ">= 6"
-      }
-    },
-    "node_modules/openai": {
-      "version": "4.4.0",
-      "resolved": "https://registry.npmjs.org/openai/-/openai-4.4.0.tgz",
-      "integrity": "sha512-JN0t628Kh95T0IrXl0HdBqnlJg+4Vq0Bnh55tio+dfCnyzHvMLiWyCM9m726MAJD2YkDU4/8RQB6rNbEq9ct2w==",
-      "dependencies": {
-        "@types/node": "^18.11.18",
-        "@types/node-fetch": "^2.6.4",
-        "abort-controller": "^3.0.0",
-        "agentkeepalive": "^4.2.1",
-        "digest-fetch": "^1.3.0",
-        "form-data-encoder": "1.7.2",
-        "formdata-node": "^4.3.2",
-        "node-fetch": "^2.6.7"
-      },
-      "bin": {
-        "openai": "bin/cli"
-      }
-    },
-    "node_modules/openapi-types": {
-      "version": "12.1.3",
-      "resolved": "https://registry.npmjs.org/openapi-types/-/openapi-types-12.1.3.tgz",
-      "integrity": "sha512-N4YtSYJqghVu4iek2ZUvcN/0aqH1kRDuNqzcycDxhOUpg7GdvLa2F3DgS6yBNhInhv2r/6I0Flkn7CqL8+nIcw=="
-    },
-    "node_modules/p-finally": {
-      "version": "1.0.0",
-      "resolved": "https://registry.npmjs.org/p-finally/-/p-finally-1.0.0.tgz",
-      "integrity": "sha512-LICb2p9CB7FS+0eR1oqWnHhp0FljGLZCWBE9aix0Uye9W8LTQPwMTYVGWQWIw9RdQiDg4+epXQODwIYJtSJaow==",
-      "engines": {
-        "node": ">=4"
-      }
-    },
-    "node_modules/p-queue": {
-      "version": "6.6.2",
-      "resolved": "https://registry.npmjs.org/p-queue/-/p-queue-6.6.2.tgz",
-      "integrity": "sha512-RwFpb72c/BhQLEXIZ5K2e+AhgNVmIejGlTgiB9MzZ0e93GRvqZ7uSi0dvRF7/XIXDeNkra2fNHBxTyPDGySpjQ==",
-      "dependencies": {
-        "eventemitter3": "^4.0.4",
-        "p-timeout": "^3.2.0"
-      },
-      "engines": {
-        "node": ">=8"
-      },
-      "funding": {
-        "url": "https://github.com/sponsors/sindresorhus"
-      }
-    },
-    "node_modules/p-retry": {
-      "version": "4.6.2",
-      "resolved": "https://registry.npmjs.org/p-retry/-/p-retry-4.6.2.tgz",
-      "integrity": "sha512-312Id396EbJdvRONlngUx0NydfrIQ5lsYu0znKVUzVvArzEIt08V1qhtyESbGVd1FGX7UKtiFp5uwKZdM8wIuQ==",
-      "dependencies": {
-        "@types/retry": "0.12.0",
-        "retry": "^0.13.1"
-      },
-      "engines": {
-        "node": ">=8"
-      }
-    },
-    "node_modules/p-timeout": {
-      "version": "3.2.0",
-      "resolved": "https://registry.npmjs.org/p-timeout/-/p-timeout-3.2.0.tgz",
-      "integrity": "sha512-rhIwUycgwwKcP9yTOOFK/AKsAopjjCakVqLHePO3CC6Mir1Z99xT+R63jZxAT5lFZLa2inS5h+ZS2GvR99/FBg==",
-      "dependencies": {
-        "p-finally": "^1.0.0"
-      },
-      "engines": {
-        "node": ">=8"
-      }
-    },
-    "node_modules/retry": {
-      "version": "0.13.1",
-      "resolved": "https://registry.npmjs.org/retry/-/retry-0.13.1.tgz",
-      "integrity": "sha512-XQBQ3I8W1Cge0Seh+6gjj03LbmRFWuoszgK9ooCpwYIrhhoO80pfq4cUkU5DkknwfOfFteRwlZ56PYOGYyFWdg==",
-      "engines": {
-        "node": ">= 4"
-      }
-    },
-    "node_modules/tr46": {
-      "version": "0.0.3",
-      "resolved": "https://registry.npmjs.org/tr46/-/tr46-0.0.3.tgz",
-      "integrity": "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw=="
-    },
-    "node_modules/typescript": {
-      "version": "5.2.2",
-      "resolved": "https://registry.npmjs.org/typescript/-/typescript-5.2.2.tgz",
-      "integrity": "sha512-mI4WrpHsbCIcwT9cF4FZvr80QUeKvsUsUvKDoR+X/7XHQH98xYD8YHZg7ANtz2GtZt/CBq2QJ0thkGJMHfqc1w==",
-      "dev": true,
-      "bin": {
-        "tsc": "bin/tsc",
-        "tsserver": "bin/tsserver"
-      },
-      "engines": {
-        "node": ">=14.17"
-      }
-    },
-    "node_modules/uuid": {
-      "version": "9.0.1",
-      "resolved": "https://registry.npmjs.org/uuid/-/uuid-9.0.1.tgz",
-      "integrity": "sha512-b+1eJOlsR9K8HJpow9Ok3fiWOWSIcIzXodvv0rQjVoOVNpWMpxf1wZNpt4y9h10odCNrqnYp1OBzRktckBe3sA==",
-      "funding": [
-        "https://github.com/sponsors/broofa",
-        "https://github.com/sponsors/ctavan"
-      ],
-      "bin": {
-        "uuid": "dist/bin/uuid"
-      }
-    },
-    "node_modules/web-streams-polyfill": {
-      "version": "4.0.0-beta.3",
-      "resolved": "https://registry.npmjs.org/web-streams-polyfill/-/web-streams-polyfill-4.0.0-beta.3.tgz",
-      "integrity": "sha512-QW95TCTaHmsYfHDybGMwO5IJIM93I/6vTRk+daHTWFPhwh+C8Cg7j7XyKrwrj8Ib6vYXe0ocYNrmzY4xAAN6ug==",
-      "engines": {
-        "node": ">= 14"
-      }
-    },
-    "node_modules/webidl-conversions": {
-      "version": "3.0.1",
-      "resolved": "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-3.0.1.tgz",
-      "integrity": "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ=="
-    },
-    "node_modules/whatwg-url": {
-      "version": "5.0.0",
-      "resolved": "https://registry.npmjs.org/whatwg-url/-/whatwg-url-5.0.0.tgz",
-      "integrity": "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw==",
-      "dependencies": {
-        "tr46": "~0.0.3",
-        "webidl-conversions": "^3.0.0"
-      }
-    },
-    "node_modules/yaml": {
-      "version": "2.3.2",
-      "resolved": "https://registry.npmjs.org/yaml/-/yaml-2.3.2.tgz",
-      "integrity": "sha512-N/lyzTPaJasoDmfV7YTrYCI0G/3ivm/9wdG0aHuheKowWQwGTsK0Eoiw6utmzAnI6pkJa0DUVygvp3spqqEKXg==",
-      "engines": {
-        "node": ">= 14"
-      }
-    },
-    "node_modules/zod": {
-      "version": "3.22.4",
-      "resolved": "https://registry.npmjs.org/zod/-/zod-3.22.4.tgz",
-      "integrity": "sha512-iC+8Io04lddc+mVqQ9AZ7OQ2MrUKGN+oIQyq1vemgt46jwCwLfhq7/pwnBnNXXXZb8VTVLKwp9EDkx+ryxIWmg==",
-      "funding": {
-        "url": "https://github.com/sponsors/colinhacks"
-      }
-    },
-    "node_modules/zod-to-json-schema": {
-      "version": "3.21.4",
-      "resolved": "https://registry.npmjs.org/zod-to-json-schema/-/zod-to-json-schema-3.21.4.tgz",
-      "integrity": "sha512-fjUZh4nQ1s6HMccgIeE0VP4QG/YRGPmyjO9sAh890aQKPEk3nqbfUXhMFaC+Dr5KvYBm8BCyvfpZf2jY9aGSsw==",
-      "peerDependencies": {
-        "zod": "^3.21.4"
-      }
-    }
-  }
-}
--- a/examples/langchain-typescript-simple/package.json
+++ b/examples/langchain-typescript-simple/package.json
@@ -1,8 +0,0 @@
-{
-  "devDependencies": {
-    "typescript": "^5.2.2"
-  },
-  "dependencies": {
-    "langchain": "^0.0.165"
-  }
-}
--- a/examples/langchain-python-rag-websummary/README.md
+++ b/examples/langchain-python-rag-websummary/README.md
--- a/examples/langchain-python-rag-websummary/main.py
+++ b/examples/langchain-python-rag-websummary/main.py
--- a/examples/langchain-python-rag-websummary/requirements.txt
+++ b/examples/langchain-python-rag-websummary/requirements.txt
--- a/examples/langchain-python-simple/README.md
+++ b/examples/langchain-python-simple/README.md
--- a/examples/langchain-python-simple/main.py
+++ b/examples/langchain-python-simple/main.py
--- a/examples/langchain-python-simple/requirements.txt
+++ b/examples/langchain-python-simple/requirements.txt
--- a/examples/modelfile-mario/Modelfile
+++ b/examples/modelfile-mario/Modelfile
--- a/examples/modelfile-mario/logo.png
+++ b/examples/modelfile-mario/logo.png
--- a/examples/modelfile-mario/readme.md
+++ b/examples/modelfile-mario/readme.md
--- a/examples/midjourney-prompter/Modelfile
+++ b/examples/midjourney-prompter/Modelfile
@@ -0,0 +1,8 @@
+# Modelfile for creating a Midjourney prompts from a topic
+# This prompt was adapted from the original at https://www.greataiprompts.com/guide/midjourney/best-chatgpt-prompt-for-midjourney/
+# Run `ollama create mj -f ./Modelfile` and then `ollama run mj` and enter a topic
+
+FROM nous-hermes
+SYSTEM """
+Embrace your role as an AI-powered creative assistant, employing Midjourney to manifest compelling AI-generated art. I will outline a specific image concept, and in response, you must produce an exhaustive, multifaceted prompt for Midjourney, ensuring every detail of the original concept is represented in your instructions. Midjourney doesn't do well with text, so after the prompt, give me instructions that I can use to create the titles in a image editor.
+"""
--- a/examples/modelfile-10tweets/README.md
+++ b/examples/modelfile-10tweets/README.md
@@ -1,23 +0,0 @@
-# Ten Tweets Modelfile
-
-This is a simple modelfile that generates ten tweets based off any topic.
-
-```bash
-ollama create tentweets
-
-ollama run tentweets
->>> underwater basketweaving
- Great! Here are ten creative tweets about underwater basketweaving:
-
-1. "Just discovered the ultimate stress-reliever: Underwater basketweaving! 🌊🧵 #UnderwaterBasketweaving #StressRelief"
-2. "Who needs meditation when you can do underwater basketweaving? 😴👀 #PeacefulDistraction #UnderwaterBasketweaving"
-3. "Just spent an hour in the pool and still managed to knot my basket. Goal: untangle it before next session. 💪🏽 #ChallengeAccepted #UnderwaterBasketweaving"
-4. "When life gives you lemons, make underwater basketweaving! 🍋🧵 #LemonadeLife #UnderwaterBasketweaving"
-5. "Just realized my underwater basketweaving skills could come in handy during a zombie apocalypse. 😂🧡 #SurvivalTips #UnderwaterBasketweaving"
-6. "I'm not lazy, I'm just conserving energy for my next underwater basketweaving session. 😴💤 #LazyDay #UnderwaterBasketweaving"
-7. "Just found my inner peace while doing underwater basketweaving. It's like meditation, but with knots! 🙏🧵 #Mindfulness #UnderwaterBasketweaving"
-8. "Why study for exams when you can do underwater basketweaving and forget all your worries? 😜🧵 #ProcrastinationStation #UnderwaterBasketweaving"
-9. "Just had to cut my underwater basketweaving session short due to a sudden urge to breathe. 🤯🌊 #AquaticAdventures #UnderwaterBasketweaving"
-10. "I'm not sure what's more impressive: my underwater basketweaving skills or the fact that I didn't drown trying to make this tweet. 😅🧵 #Accomplishment 
-#UnderwaterBasketweaving"
-```
--- a/examples/modelfile-devopsengineer/README.md
+++ b/examples/modelfile-devopsengineer/README.md
@@ -1,40 +0,0 @@
-# DevOps Engineer Modelfile
-
-This is a simple modelfile that takes will respond to any questions that a senior DevOps engineer should know how to solve.
-
-````text
-ollama create devopsengineer
-
-ollama run devopsengineer
->>> how can i proxy all connections from inside the office through a single server that caches large files. Provide a config file for nginx to achieve this?
-
-Here is an example configuration file for Nginx that can be used to proxy all connections from inside the office network to a centralized cache server:
-```perl
-http {
-    upstream cache_server {
-        server <cache_server_ip>;
-        server <cache_server_ip>;
-        server <cache_server_ip>;
-    }
-    
-    server {
-        listen 80;
-        server_name localhost;
-        
-        location / {
-            proxy_pass http://cache_server;
-            proxy_set_header Host $host;
-            proxy_set_header X-Real-IP $remote_addr;
-            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-        }
-    }
-}
-```
-In this configuration, we define an `upstream` block that lists the IP addresses of the cache servers. We then define a server block that listens on port 80 and routes all 
-traffic to the `/` location to the cache servers using the `proxy_pass` directive. The `proxy_set_header` directives are used to preserve the source IP address of the client
-request when forwarding it to the cache server.
-
-To use this configuration, you would need to replace the placeholder `<cache_server_ip>` with the actual IP addresses of your cache servers. You would also need to make sure
-that the cache servers are configured to accept incoming connections from the Nginx server and handle requests for files.
-
-````
--- a/examples/modelfile-midjourney/Modelfile
+++ b/examples/modelfile-midjourney/Modelfile
@@ -1,11 +0,0 @@
-# Modelfile for creating a Midjourney prompts from a topic
-# This prompt was adapted from the original at https://www.greataiprompts.com/guide/midjourney/best-chatgpt-prompt-for-midjourney/
-# Run `ollama create mj -f ./Modelfile` and then `ollama run mj` and enter a topic
-
-FROM zephyr
-PARAMETER temperature 0.8
-PARAMETER top_k 500
-PARAMETER top_p 0.9
-SYSTEM """
-Embrace your role as a creative illustrator. Based on a concept provided, you must produce a single paragraph with a multifaceted description of an image, ensuring significant details of the concept and more is represented in your instructions. You do not need to write complete sentences but rather short concepts with the following information: the level of detail that should be represented, an artistic style and maybe a specific name of a painter or illustrator, the ideal color pallete, lighting, mood, perspective, the setting, time of day, weather, the season, the time period, location, materials, the textures, patterns, lines, brushstrokes, techniques, the medium, the genre, the rendering style. Don't include everything and keep the description length under 250 words. 
-"""
--- a/examples/modelfile-midjourney/README.md
+++ b/examples/modelfile-midjourney/README.md
@@ -1,11 +0,0 @@
-# Midjourney Prompt Generator Modelfile
-
-This simple modelfile will help create a prompt to feed to Midjourney.
-
-```text
-ollama create midjourney
-
-ollama run midjourney
->>> a sports car in the mountains. 
-A sleek, high-performance automobile cuts through a serpentine mountain landscape. The concept is a classic illustration of speed and power, depicted in the style of pop art by Andy Warhol. The color palette is dominated by bold, primary hues of red, blue, and yellow, with striking accent colors of white, black, and metallic shades. The lighting is bright and focused, casting sharp shadows on the rugged terrain. A sense of excitement and anticipation permeates throughout the scene, as the car navigates a treacherous course through the winding road. The perspective is low, allowing for a full view of the vehicle's sleek lines and intricate details. The setting takes place in the afternoon during a sunny day in autumn, as evidenced by the vibrant foliage on the mountainside. The time period is modern, with nods to classic car design. The materials are primarily digital, allowing for smooth curves and sharp contrasts. The textures are sleek and polished, with meticulously detailed lines and brushstrokes that accentuate the car's aerodynamic design. The patterns consist of geometric shapes and bold stripes, adding to the car's dynamic appeal. The genre is modern realism, with a focus on precision and detail. The rendering style is highly technical, capturing the nuances and subtleties of the vehicle and its surroundings in breathtaking detail.
-```
--- a/examples/modelfile-recipemaker/README.md
+++ b/examples/modelfile-recipemaker/README.md
@@ -1,20 +0,0 @@
-# Recipe Maker Modelfile 
-
-Simple modelfile to generate a recipe from a short list of ingredients.
-
-```
-ollama create recipemaker
-
-ollama run recipemaker
->>> chilli pepper, white chocolate, kale
- Ingredients:
- 1 small chili pepper
- 4 squares of white chocolate
- handful of kale leaves
-
-Instructions:
-1. In a blender or food processor, puree the chilies and white chocolate until smooth.
-2. Add the chopped kale leaves to the blender and pulse until well combined.
-3. Serve immediately as a dip for crackers or use it as an ingredient in your favorite recipe. The mixture of spicy chili pepper with sweet white chocolate and nutritious 
-kale will make your taste buds dance with delight!
-```
--- a/examples/langchain-python-rag-privategpt/.gitignore
+++ b/examples/langchain-python-rag-privategpt/.gitignore
--- a/examples/langchain-python-rag-privategpt/LICENSE
+++ b/examples/langchain-python-rag-privategpt/LICENSE
--- a/examples/langchain-python-rag-privategpt/README.md
+++ b/examples/langchain-python-rag-privategpt/README.md
--- a/examples/langchain-python-rag-privategpt/constants.py
+++ b/examples/langchain-python-rag-privategpt/constants.py
--- a/examples/langchain-python-rag-privategpt/ingest.py
+++ b/examples/langchain-python-rag-privategpt/ingest.py
--- a/examples/langchain-python-rag-privategpt/poetry.lock
+++ b/examples/langchain-python-rag-privategpt/poetry.lock
--- a/examples/langchain-python-rag-privategpt/privateGPT.py
+++ b/examples/langchain-python-rag-privategpt/privateGPT.py
--- a/examples/langchain-python-rag-privategpt/pyproject.toml
+++ b/examples/langchain-python-rag-privategpt/pyproject.toml
--- a/examples/langchain-python-rag-privategpt/requirements.txt
+++ b/examples/langchain-python-rag-privategpt/requirements.txt
--- a/examples/python-rag-newssummary/README.md
+++ b/examples/python-rag-newssummary/README.md
@@ -1,22 +0,0 @@
-# News Summarizer
-
-This example goes through a series of steps:
-
-  1. You choose a topic area (e.g., "news", "NVidia", "music", etc.).
-  2. Gets the most recent articles on that topic from various sources.
-  3. Uses Ollama to summarize each article.
-  4. Creates chunks of sentences from each article.
-  5. Uses Sentence Transformers to generate embeddings for each of those chunks.
-  6. You enter a question regarding the summaries shown.
-  7. Uses Sentence Transformers to generate an embedding for that question.
-  8. Uses the embedded question to find the most similar chunks.
-  9. Feeds all that to Ollama to generate a good answer to your question based on these news articles.
-
-This example lets you pick from a few different topic areas, then summarize the most recent x articles for that topic. It then creates chunks of sentences from each article and then generates embeddings for each of those chunks.
-
-You can run the example like this:
-
-```bash
-pip install -r requirements.txt
-python summ.py
-```
--- a/examples/python-rag-newssummary/requirements.txt
+++ b/examples/python-rag-newssummary/requirements.txt
@@ -1,9 +0,0 @@
-beautifulsoup4==4.12.2
-feedparser==6.0.10
-mattsollamatools==0.0.8
-newspaper3k==0.2.8
-nltk==3.8.1
-numpy==1.24.3
-Requests==2.31.0
-scikit_learn==1.3.0
-sentence_transformers==2.2.2
--- a/examples/python-rag-newssummary/summ.py
+++ b/examples/python-rag-newssummary/summ.py
@@ -1,86 +0,0 @@
-import curses
-import json
-from utils import get_url_for_topic, topic_urls, menu, getUrls, get_summary, getArticleText, knn_search
-import requests
-from sentence_transformers import SentenceTransformer
-from mattsollamatools import chunker
-
-if __name__ == "__main__":
-    chosen_topic = curses.wrapper(menu)
-    print("Here is your news summary:\n")
-    urls = getUrls(chosen_topic, n=5)
-    model = SentenceTransformer('all-MiniLM-L6-v2')
-    allEmbeddings = []
-
-    for url in urls:
-      article={}
-      article['embeddings'] = []
-      article['url'] = url
-      text = getArticleText(url)
-      summary = get_summary(text)
-      chunks = chunker(text)  # Use the chunk_text function from web_utils
-      embeddings = model.encode(chunks)
-      for (chunk, embedding) in zip(chunks, embeddings):
-        item = {}
-        item['source'] = chunk
-        item['embedding'] = embedding.tolist()  # Convert NumPy array to list
-        item['sourcelength'] = len(chunk)
-        article['embeddings'].append(item)
-    
-      allEmbeddings.append(article)
-
-      print(f"{summary}\n")
-
-    
-    while True:
-      context = []
-      # Input a question from the user
-      question = input("Enter your question about the news, or type quit: ")
-
-      if question.lower() == 'quit':
-        break
-
-      # Embed the user's question
-      question_embedding = model.encode([question])
-
-      # Perform KNN search to find the best matches (indices and source text)
-      best_matches = knn_search(question_embedding, allEmbeddings, k=10)
-
-
-      sourcetext=""
-      for i, (index, source_text) in enumerate(best_matches, start=1):
-          sourcetext += f"{i}. Index: {index}, Source Text: {source_text}"
-
-      systemPrompt = f"Only use the following information to answer the question. Do not use anything else: {sourcetext}"
-
-      url = "http://localhost:11434/api/generate"
-
-      payload = {
-      "model": "mistral-openorca",
-      "prompt": question, 
-      "system": systemPrompt,
-      "stream": False, 
-      "context": context
-      }
-
-      # Convert the payload to a JSON string
-      payload_json = json.dumps(payload)
-
-      # Set the headers to specify JSON content
-      headers = {
-          "Content-Type": "application/json"
-      }
-
-      # Send the POST request
-      response = requests.post(url, data=payload_json, headers=headers)
-
-      # Check the response
-      if response.status_code == 200:
-          output = json.loads(response.text)
-          context = output['context']
-          print(output['response']+ "\n")
-          
-
-      else:
-          print(f"Request failed with status code {response.status_code}")
-
--- a/examples/python-rag-newssummary/utils.py
+++ b/examples/python-rag-newssummary/utils.py
@@ -1,108 +0,0 @@
-import curses
-import feedparser
-import requests
-import unicodedata
-import json
-from newspaper import Article
-from bs4 import BeautifulSoup
-from nltk.tokenize import sent_tokenize, word_tokenize
-import numpy as np
-from sklearn.neighbors import NearestNeighbors
-from mattsollamatools import chunker
-
-# Create a dictionary to store topics and their URLs
-topic_urls = {
-    "Mac": "https://9to5mac.com/guides/mac/feed",
-    "News": "http://www.npr.org/rss/rss.php?id=1001",
-    "Nvidia": "https://nvidianews.nvidia.com/releases.xml",
-    "Raspberry Pi": "https://www.raspberrypi.com/news/feed/", 
-    "Music": "https://www.billboard.com/c/music/music-news/feed/"
-}
-
-# Use curses to create a menu of topics
-def menu(stdscr):
-    chosen_topic = get_url_for_topic(stdscr)  
-    url = topic_urls[chosen_topic] if chosen_topic in topic_urls else "Topic not found"
-    
-    stdscr.addstr(len(topic_urls) + 3, 0, f"Selected URL for {chosen_topic}: {url}")
-    stdscr.refresh()
-    
-    return chosen_topic
-
-# You have chosen a topic. Now return the url for that topic
-def get_url_for_topic(stdscr):
-    curses.curs_set(0)  # Hide the cursor
-    stdscr.clear()
-
-    stdscr.addstr(0, 0, "Choose a topic using the arrow keys (Press Enter to select):")
-
-    # Create a list of topics
-    topics = list(topic_urls.keys())
-    current_topic = 0
-
-    while True:
-        for i, topic in enumerate(topics):
-            if i == current_topic:
-                stdscr.addstr(i + 2, 2, f"> {topic}")
-            else:
-                stdscr.addstr(i + 2, 2, f"  {topic}")
-
-        stdscr.refresh()
-
-        key = stdscr.getch()
-
-        if key == curses.KEY_DOWN and current_topic < len(topics) - 1:
-            current_topic += 1
-        elif key == curses.KEY_UP and current_topic > 0:
-            current_topic -= 1
-        elif key == 10:  # Enter key
-            return topic_urls[topics[current_topic]]
-
-# Get the last N URLs from an RSS feed
-def getUrls(feed_url, n=20):
-    feed = feedparser.parse(feed_url)
-    entries = feed.entries[-n:]
-    urls = [entry.link for entry in entries]
-    return urls
-
-# Often there are a bunch of ads and menus on pages for a news article. This uses newspaper3k to get just the text of just the article.
-def getArticleText(url):
-  article = Article(url)
-  article.download()
-  article.parse()
-  return article.text
-
-def get_summary(text):
-  systemPrompt = "Write a concise summary of the text, return your responses with 5 lines that cover the key points of the text given."
-  prompt = text
-  
-  url = "http://localhost:11434/api/generate"
-
-  payload = {
-    "model": "mistral-openorca",
-    "prompt": prompt, 
-    "system": systemPrompt,
-    "stream": False
-  }
-  payload_json = json.dumps(payload)
-  headers = {"Content-Type": "application/json"}
-  response = requests.post(url, data=payload_json, headers=headers)
-
-  return json.loads(response.text)["response"]
-
-# Perform K-nearest neighbors (KNN) search
-def knn_search(question_embedding, embeddings, k=5):
-    X = np.array([item['embedding'] for article in embeddings for item in article['embeddings']])
-    source_texts = [item['source'] for article in embeddings for item in article['embeddings']]
-    
-    # Fit a KNN model on the embeddings
-    knn = NearestNeighbors(n_neighbors=k, metric='cosine')
-    knn.fit(X)
-    
-    # Find the indices and distances of the k-nearest neighbors
-    distances, indices = knn.kneighbors(question_embedding, n_neighbors=k)
-    
-    # Get the indices and source texts of the best matches
-    best_matches = [(indices[0][i], source_texts[indices[0][i]]) for i in range(k)]
-    
-    return best_matches
--- a/examples/python-simplegenerate/client.py
+++ b/examples/python-simplegenerate/client.py
--- a/examples/modelfile-recipemaker/Modelfile
+++ b/examples/modelfile-recipemaker/Modelfile
--- a/examples/modelfile-sentiments/Modelfile
+++ b/examples/modelfile-sentiments/Modelfile
--- a/examples/modelfile-sentiments/Readme.md
+++ b/examples/modelfile-sentiments/Readme.md
--- a/examples/modelfile-tweetwriter/Modelfile
+++ b/examples/modelfile-tweetwriter/Modelfile
--- a/examples/typescript-mentors/.gitignore
+++ b/examples/typescript-mentors/.gitignore
@@ -1,2 +0,0 @@
-node_modules
-package-lock.json
--- a/examples/typescript-mentors/README.md
+++ b/examples/typescript-mentors/README.md
@@ -1,21 +0,0 @@
-# Ask the Mentors
-
-This example demonstrates how one would create a set of 'mentors' you can have a conversation with. The mentors are generated using the `character-generator.ts` file. This will use **Stable Beluga 70b** to create a bio and list of verbal ticks and common phrases used by each person. Then `mentors.ts` will take a question, and choose three of the 'mentors' and start a conversation with them. Occasionally, they will talk to each other, and other times they will just deliver a set of monologues. It's fun to see what they do and say.
-
-## Usage
-
-```bash
-ts-node ./character-generator.ts "Lorne Greene"
-```
-
-This will create `lornegreene/Modelfile`. Now you can create a model with this command:
-
-```bash
-ollama create lornegreene -f lornegreene/Modelfile
-```
-
-If you want to add your own mentors, you will have to update the code to look at your namespace instead of **mattw**. Also set the list of mentors to include yours.
-
-```bash
-ts-node ./mentors.ts "What is a Jackalope?"
-```
--- a/examples/typescript-mentors/character-generator.ts
+++ b/examples/typescript-mentors/character-generator.ts
@@ -1,26 +0,0 @@
-import { Ollama } from 'ollama-node'
-import fs from 'fs';
-import path from 'path';
-
-async function characterGenerator() {
-  const character = process.argv[2];
-  console.log(`You are creating a character for ${character}.`);
-  const foldername = character.replace(/\s/g, '').toLowerCase();
-  const directory = path.join(__dirname, foldername);
-  if (!fs.existsSync(directory)) {
-    fs.mkdirSync(directory, { recursive: true });
-  }
-
-  const ollama = new Ollama();
-  ollama.setModel("stablebeluga2:70b-q4_K_M");
-  const bio = await ollama.generate(`create a bio of ${character} in a single long paragraph. Instead of saying '${character} is...' or '${character} was...' use language like 'You are...' or 'You were...'. Then create a paragraph describing the speaking mannerisms and style of ${character}. Don't include anything about how ${character} looked or what they sounded like, just focus on the words they said. Instead of saying '${character} would say...' use language like 'You should say...'. If you use quotes, always use single quotes instead of double quotes. If there are any specific words or phrases you used a lot, show how you used them. `);
-
-  const thecontents = `FROM llama2\nSYSTEM """\n${bio.response.replace(/(\r\n|\n|\r)/gm, " ").replace('would', 'should')} All answers to questions should be related back to what you are most known for.\n"""`;
-
-  fs.writeFile(path.join(directory, 'Modelfile'), thecontents, (err: any) => {
-    if (err) throw err;
-    console.log('The file has been saved!');
-  });
-}
-
-characterGenerator();
--- a/examples/typescript-mentors/mentors.ts
+++ b/examples/typescript-mentors/mentors.ts
@@ -1,59 +0,0 @@
-import { Ollama } from 'ollama-node';
-
-const mentorCount = 3;
-const ollama = new Ollama();
-
-function getMentors(): string[] {
-  const mentors = ['Gary Vaynerchuk', 'Kanye West', 'Martha Stewart', 'Neil deGrasse Tyson', 'Owen Wilson', 'Ronald Reagan', 'Donald Trump', 'Barack Obama', 'Jeff Bezos'];
-  const chosenMentors: string[] = [];
-  for (let i = 0; i < mentorCount; i++) {
-    const mentor = mentors[Math.floor(Math.random() * mentors.length)];
-    chosenMentors.push(mentor);
-    mentors.splice(mentors.indexOf(mentor), 1);
-  }
-  return chosenMentors;
-}
-
-function getMentorFileName(mentor: string): string {
-  const model = mentor.toLowerCase().replace(/\s/g, '');
-  return `mattw/${model}`;
-}
-
-async function getSystemPrompt(mentor: string, isLast: boolean, question: string): Promise<string> {
-  ollama.setModel(getMentorFileName(mentor));
-  const info = await ollama.showModelInfo()
-  let SystemPrompt = info.system || '';
-  SystemPrompt += ` You should continue the conversation as if you were ${mentor} and acknowledge the people before you in the conversation. You should adopt their mannerisms and tone, but also not use language they wouldn't use. If they are not known to know about the concept in the question, don't offer an answer. Your answer should be no longer than 1 paragraph. And definitely try not to sound like anyone else. Don't repeat any slang or phrases already used. And if it is a question the original ${mentor} wouldn't have know the answer to, just say that you don't know, in the style of ${mentor}. And think about the time the person lived. Don't use terminology that they wouldn't have used.`
-
-  if (isLast) {
-    SystemPrompt += ` End your answer with something like I hope our answers help you out`;
-  } else {
-    SystemPrompt += ` Remember, this is a conversation, so you don't need a conclusion, but end your answer with a question related to the first question: "${question}".`;
-  }
-  return SystemPrompt;
-}
-
-async function main() {
-  const mentors = getMentors();
-  const question = process.argv[2];
-  let theConversation = `Here is the conversation so far.\nYou: ${question}\n`
-
-  for await (const mentor of mentors) {
-    const SystemPrompt = await getSystemPrompt(mentor, mentor === mentors[mentorCount - 1], question);
-    ollama.setModel(getMentorFileName(mentor));
-    ollama.setSystemPrompt(SystemPrompt);
-    let output = '';
-    process.stdout.write(`\n${mentor}: `);
-    for await (const chunk of ollama.streamingGenerate(theConversation + `Continue the conversation as if you were ${mentor} on the question "${question}".`)) {
-      if (chunk.response) {
-        output += chunk.response;
-        process.stdout.write(chunk.response);
-      } else {
-        process.stdout.write('\n');
-      }
-    }
-    theConversation += `${mentor}: ${output}\n\n`
-  }
-}
-
-main();
--- a/examples/typescript-mentors/package.json
+++ b/examples/typescript-mentors/package.json
@@ -1,7 +0,0 @@
-{
-  "dependencies": {
-    "fs": "^0.0.1-security",
-    "ollama-node": "^0.0.3",
-    "path": "^0.12.7"
-  }
-}
--- a/format/bytes.go
+++ b/format/bytes.go
@@ -1,23 +0,0 @@
-package format
-
-import "fmt"
-
-const (
-	Byte     = 1
-	KiloByte = Byte * 1000
-	MegaByte = KiloByte * 1000
-	GigaByte = MegaByte * 1000
-)
-
-func HumanBytes(b int64) string {
-	switch {
-	case b > GigaByte:
-		return fmt.Sprintf("%d GB", b/GigaByte)
-	case b > MegaByte:
-		return fmt.Sprintf("%d MB", b/MegaByte)
-	case b > KiloByte:
-		return fmt.Sprintf("%d KB", b/KiloByte)
-	default:
-		return fmt.Sprintf("%d B", b)
-	}
-}
--- a/format/time.go
+++ b/format/time.go
@@ -7,14 +7,26 @@ import (
 	"time"
 )

-// humanDuration returns a human-readable approximation of a
-// duration (eg. "About a minute", "4 hours ago", etc.).
-func humanDuration(d time.Duration) string {
+// HumanDuration returns a human-readable approximation of a duration
+// (eg. "About a minute", "4 hours ago", etc.).
+// Modified version of github.com/docker/go-units.HumanDuration
+func HumanDuration(d time.Duration) string {
+	return HumanDurationWithCase(d, true)
+}
+
+// HumanDurationWithCase returns a human-readable approximation of a
+// duration (eg. "About a minute", "4 hours ago", etc.). but allows
+// you to specify whether the first word should be capitalized
+// (eg. "About" vs. "about")
+func HumanDurationWithCase(d time.Duration, useCaps bool) string {
 	seconds := int(d.Seconds())

 	switch {
 	case seconds < 1:
-		return "Less than a second"
+		if useCaps {
+			return "Less than a second"
+		}
+		return "less than a second"
 	case seconds == 1:
 		return "1 second"
 	case seconds < 60:
@@ -24,7 +36,10 @@ func humanDuration(d time.Duration) string {
 	minutes := int(d.Minutes())
 	switch {
 	case minutes == 1:
-		return "About a minute"
+		if useCaps {
+			return "About a minute"
+		}
+		return "about a minute"
 	case minutes < 60:
 		return fmt.Sprintf("%d minutes", minutes)
 	}
@@ -32,7 +47,10 @@ func humanDuration(d time.Duration) string {
 	hours := int(math.Round(d.Hours()))
 	switch {
 	case hours == 1:
-		return "About an hour"
+		if useCaps {
+			return "About an hour"
+		}
+		return "about an hour"
 	case hours < 48:
 		return fmt.Sprintf("%d hours", hours)
 	case hours < 24*7*2:
@@ -47,22 +65,77 @@ func humanDuration(d time.Duration) string {
 }

 func HumanTime(t time.Time, zeroValue string) string {
-	return humanTime(t, zeroValue)
+	return humanTimeWithCase(t, zeroValue, true)
 }

 func HumanTimeLower(t time.Time, zeroValue string) string {
-	return strings.ToLower(humanTime(t, zeroValue))
+	return humanTimeWithCase(t, zeroValue, false)
 }

-func humanTime(t time.Time, zeroValue string) string {
+func humanTimeWithCase(t time.Time, zeroValue string, useCaps bool) string {
 	if t.IsZero() {
 		return zeroValue
 	}

 	delta := time.Since(t)
 	if delta < 0 {
-		return humanDuration(-delta) + " from now"
+		return HumanDurationWithCase(-delta, useCaps) + " from now"
+	}
+	return HumanDurationWithCase(delta, useCaps) + " ago"
+}
+
+// ExcatDuration returns a human readable hours/minutes/seconds or milliseconds format of a duration
+// the most precise level of duration is milliseconds
+func ExactDuration(d time.Duration) string {
+	if d.Seconds() < 1 {
+		if d.Milliseconds() == 1 {
+			return fmt.Sprintf("%d millisecond", d.Milliseconds())
+		}
+		return fmt.Sprintf("%d milliseconds", d.Milliseconds())
 	}

-	return humanDuration(delta) + " ago"
+	var readableDur strings.Builder
+
+	dur := d.String()
+
+	// split the default duration string format of 0h0m0s into something nicer to read
+	h := strings.Split(dur, "h")
+	if len(h) > 1 {
+		hours := h[0]
+		if hours == "1" {
+			readableDur.WriteString(fmt.Sprintf("%s hour ", hours))
+		} else {
+			readableDur.WriteString(fmt.Sprintf("%s hours ", hours))
+		}
+		dur = h[1]
+	}
+
+	m := strings.Split(dur, "m")
+	if len(m) > 1 {
+		mins := m[0]
+		switch mins {
+		case "0":
+			// skip
+		case "1":
+			readableDur.WriteString(fmt.Sprintf("%s minute ", mins))
+		default:
+			readableDur.WriteString(fmt.Sprintf("%s minutes ", mins))
+		}
+		dur = m[1]
+	}
+
+	s := strings.Split(dur, "s")
+	if len(s) > 0 {
+		sec := s[0]
+		switch sec {
+		case "0":
+			// skip
+		case "1":
+			readableDur.WriteString(fmt.Sprintf("%s second ", sec))
+		default:
+			readableDur.WriteString(fmt.Sprintf("%s seconds ", sec))
+		}
+	}
+
+	return strings.TrimSpace(readableDur.String())
 }
--- a/format/time_test.go
+++ b/format/time_test.go
@@ -11,25 +11,92 @@ func assertEqual(t *testing.T, a interface{}, b interface{}) {
 	}
 }

+func TestHumanDuration(t *testing.T) {
+	day := 24 * time.Hour
+	week := 7 * day
+	month := 30 * day
+	year := 365 * day
+
+	assertEqual(t, "Less than a second", HumanDuration(450*time.Millisecond))
+	assertEqual(t, "Less than a second", HumanDurationWithCase(450*time.Millisecond, true))
+	assertEqual(t, "less than a second", HumanDurationWithCase(450*time.Millisecond, false))
+	assertEqual(t, "1 second", HumanDuration(1*time.Second))
+	assertEqual(t, "45 seconds", HumanDuration(45*time.Second))
+	assertEqual(t, "46 seconds", HumanDuration(46*time.Second))
+	assertEqual(t, "59 seconds", HumanDuration(59*time.Second))
+	assertEqual(t, "About a minute", HumanDuration(60*time.Second))
+	assertEqual(t, "About a minute", HumanDurationWithCase(1*time.Minute, true))
+	assertEqual(t, "about a minute", HumanDurationWithCase(1*time.Minute, false))
+	assertEqual(t, "3 minutes", HumanDuration(3*time.Minute))
+	assertEqual(t, "35 minutes", HumanDuration(35*time.Minute))
+	assertEqual(t, "35 minutes", HumanDuration(35*time.Minute+40*time.Second))
+	assertEqual(t, "45 minutes", HumanDuration(45*time.Minute))
+	assertEqual(t, "45 minutes", HumanDuration(45*time.Minute+40*time.Second))
+	assertEqual(t, "46 minutes", HumanDuration(46*time.Minute))
+	assertEqual(t, "59 minutes", HumanDuration(59*time.Minute))
+	assertEqual(t, "About an hour", HumanDuration(1*time.Hour))
+	assertEqual(t, "About an hour", HumanDurationWithCase(1*time.Hour+29*time.Minute, true))
+	assertEqual(t, "about an hour", HumanDurationWithCase(1*time.Hour+29*time.Minute, false))
+	assertEqual(t, "2 hours", HumanDuration(1*time.Hour+31*time.Minute))
+	assertEqual(t, "2 hours", HumanDuration(1*time.Hour+59*time.Minute))
+	assertEqual(t, "3 hours", HumanDuration(3*time.Hour))
+	assertEqual(t, "3 hours", HumanDuration(3*time.Hour+29*time.Minute))
+	assertEqual(t, "4 hours", HumanDuration(3*time.Hour+31*time.Minute))
+	assertEqual(t, "4 hours", HumanDuration(3*time.Hour+59*time.Minute))
+	assertEqual(t, "4 hours", HumanDuration(3*time.Hour+60*time.Minute))
+	assertEqual(t, "24 hours", HumanDuration(24*time.Hour))
+	assertEqual(t, "36 hours", HumanDuration(1*day+12*time.Hour))
+	assertEqual(t, "2 days", HumanDuration(2*day))
+	assertEqual(t, "7 days", HumanDuration(7*day))
+	assertEqual(t, "13 days", HumanDuration(13*day+5*time.Hour))
+	assertEqual(t, "2 weeks", HumanDuration(2*week))
+	assertEqual(t, "2 weeks", HumanDuration(2*week+4*day))
+	assertEqual(t, "3 weeks", HumanDuration(3*week))
+	assertEqual(t, "4 weeks", HumanDuration(4*week))
+	assertEqual(t, "4 weeks", HumanDuration(4*week+3*day))
+	assertEqual(t, "4 weeks", HumanDuration(1*month))
+	assertEqual(t, "6 weeks", HumanDuration(1*month+2*week))
+	assertEqual(t, "2 months", HumanDuration(2*month))
+	assertEqual(t, "2 months", HumanDuration(2*month+2*week))
+	assertEqual(t, "3 months", HumanDuration(3*month))
+	assertEqual(t, "3 months", HumanDuration(3*month+1*week))
+	assertEqual(t, "5 months", HumanDuration(5*month+2*week))
+	assertEqual(t, "13 months", HumanDuration(13*month))
+	assertEqual(t, "23 months", HumanDuration(23*month))
+	assertEqual(t, "24 months", HumanDuration(24*month))
+	assertEqual(t, "2 years", HumanDuration(24*month+2*week))
+	assertEqual(t, "3 years", HumanDuration(3*year+2*month))
+}
+
 func TestHumanTime(t *testing.T) {
 	now := time.Now()

 	t.Run("zero value", func(t *testing.T) {
 		assertEqual(t, HumanTime(time.Time{}, "never"), "never")
 	})
-
 	t.Run("time in the future", func(t *testing.T) {
 		v := now.Add(48 * time.Hour)
 		assertEqual(t, HumanTime(v, ""), "2 days from now")
 	})
-
 	t.Run("time in the past", func(t *testing.T) {
 		v := now.Add(-48 * time.Hour)
 		assertEqual(t, HumanTime(v, ""), "2 days ago")
 	})
-
-	t.Run("soon", func(t *testing.T) {
-		v := now.Add(800 * time.Millisecond)
-		assertEqual(t, HumanTime(v, ""), "Less than a second from now")
-	})
+}
+
+func TestExactDuration(t *testing.T) {
+	assertEqual(t, "1 millisecond", ExactDuration(1*time.Millisecond))
+	assertEqual(t, "10 milliseconds", ExactDuration(10*time.Millisecond))
+	assertEqual(t, "1 second", ExactDuration(1*time.Second))
+	assertEqual(t, "10 seconds", ExactDuration(10*time.Second))
+	assertEqual(t, "1 minute", ExactDuration(1*time.Minute))
+	assertEqual(t, "10 minutes", ExactDuration(10*time.Minute))
+	assertEqual(t, "1 hour", ExactDuration(1*time.Hour))
+	assertEqual(t, "10 hours", ExactDuration(10*time.Hour))
+	assertEqual(t, "1 hour 1 second", ExactDuration(1*time.Hour+1*time.Second))
+	assertEqual(t, "1 hour 10 seconds", ExactDuration(1*time.Hour+10*time.Second))
+	assertEqual(t, "1 hour 1 minute", ExactDuration(1*time.Hour+1*time.Minute))
+	assertEqual(t, "1 hour 10 minutes", ExactDuration(1*time.Hour+10*time.Minute))
+	assertEqual(t, "1 hour 1 minute 1 second", ExactDuration(1*time.Hour+1*time.Minute+1*time.Second))
+	assertEqual(t, "10 hours 10 minutes 10 seconds", ExactDuration(10*time.Hour+10*time.Minute+10*time.Second))
 }
--- a/go.mod
+++ b/go.mod
@@ -10,7 +10,6 @@ require (
 	github.com/olekukonko/tablewriter v0.0.5
 	github.com/pdevine/readline v1.5.2
 	github.com/spf13/cobra v1.7.0
-	golang.org/x/sync v0.3.0
 )

 require github.com/rivo/uniseg v0.2.0 // indirect
@@ -45,6 +44,7 @@ require (
 	golang.org/x/sys v0.11.0 // indirect
 	golang.org/x/term v0.10.0
 	golang.org/x/text v0.10.0 // indirect
+	gonum.org/v1/gonum v0.13.0
 	google.golang.org/protobuf v1.30.0 // indirect
 	gopkg.in/yaml.v3 v3.0.1 // indirect
 )
--- a/go.sum
+++ b/go.sum
@@ -125,8 +125,6 @@ golang.org/x/exp v0.0.0-20230817173708-d852ddb80c63/go.mod h1:0v4NqG35kSWCMzLaMe
 golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=
 golang.org/x/net v0.10.0 h1:X2//UzNDwYmtCLn7To6G58Wr6f5ahEAQgKNzv9Y951M=
 golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=
-golang.org/x/sync v0.3.0 h1:ftCYgMx6zT/asHUrPw8BLLscYtGznsLAnjq5RH9P66E=
-golang.org/x/sync v0.3.0/go.mod h1:FU7BRWz2tNW+3quACPkgCx/L+uEAv1htQ0V83Z9Rj+Y=
 golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
@@ -145,6 +143,8 @@ golang.org/x/text v0.10.0 h1:UpjohKhiEgNc0CSauXmwYftY1+LlaC75SJwh0SgCX58=
 golang.org/x/text v0.10.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
 golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
 golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
+gonum.org/v1/gonum v0.13.0 h1:a0T3bh+7fhRyqeNbiC3qVHYmkiQgit3wnNan/2c0HMM=
+gonum.org/v1/gonum v0.13.0/go.mod h1:/WPYRckkfWrhWefxyYTfrTtQR0KH4iyHNuzxqXAKyAU=
 google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
 google.golang.org/protobuf v1.28.0/go.mod h1:HV8QOd/L58Z+nl8r43ehVNZIU/HEI6OcFqwMG9pJV4I=
 google.golang.org/protobuf v1.30.0 h1:kPPoIgf3TsEvrm0PFe15JQ+570QVxYzEvvHqChK+cng=
--- a/llm/falcon.go
+++ b/llm/falcon.go
@@ -1,5 +1,7 @@
 package llm

+const ModelFamilyFalcon = "falcon"
+
 const (
 	falconModelType7B   = 32
 	falconModelType40B  = 60
@@ -15,6 +17,6 @@ func falconModelType(numLayer uint32) string {
 	case 80:
 		return "180B"
 	default:
-		return "unknown"
+		return "Unknown"
 	}
 }
--- a/llm/ggml.go
+++ b/llm/ggml.go
@@ -69,7 +69,7 @@ func fileType(fileType uint32) string {
 	case fileTypeQ6_K:
 		return "Q6_K"
 	default:
-		return "unknown"
+		return "Unknown"
 	}
 }

--- a/llm/gguf.go
+++ b/llm/gguf.go
@@ -109,13 +109,9 @@ func (llm *ggufModel) ModelType() string {
 		if blocks, ok := llm.kv["falcon.block_count"].(uint32); ok {
 			return falconModelType(blocks)
 		}
-	case "starcoder":
-		if blocks, ok := llm.kv["starcoder.block_count"].(uint32); ok {
-			return starCoderModelType(blocks)
-		}
 	}

-	return "unknown"
+	return "Unknown"
 }

 func (llm *ggufModel) FileType() string {
@@ -124,7 +120,7 @@ func (llm *ggufModel) FileType() string {
 		return fileType(t)
 	}

-	return "unknown"
+	return "Unknown"
 }

 func (llm *ggufModel) Decode(r io.Reader) error {
--- a/llm/llama.cpp/generate_darwin_amd64.go
+++ b/llm/llama.cpp/generate_darwin_amd64.go
@@ -9,10 +9,8 @@ package llm
 //go:generate git -C ggml apply ../patches/0004-metal-add-missing-barriers-for-mul-mat-2699.patch
 //go:generate cmake -S ggml -B ggml/build/cpu -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build ggml/build/cpu --target server --config Release
-//go:generate mv ggml/build/cpu/bin/server ggml/build/cpu/bin/ollama-runner

 //go:generate git submodule update --force gguf
 //go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
 //go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build gguf/build/cpu --target server --config Release
-//go:generate mv gguf/build/cpu/bin/server gguf/build/cpu/bin/ollama-runner
--- a/llm/llama.cpp/generate_darwin_arm64.go
+++ b/llm/llama.cpp/generate_darwin_arm64.go
@@ -9,10 +9,8 @@ package llm
 //go:generate git -C ggml apply ../patches/0004-metal-add-missing-barriers-for-mul-mat-2699.patch
 //go:generate cmake -S ggml -B ggml/build/metal -DLLAMA_METAL=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=arm64 -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build ggml/build/metal --target server --config Release
-//go:generate mv ggml/build/metal/bin/server ggml/build/metal/bin/ollama-runner

 //go:generate git submodule update --force gguf
 //go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
 //go:generate cmake -S gguf -B gguf/build/metal -DLLAMA_METAL=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=arm64 -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build gguf/build/metal --target server --config Release
-//go:generate mv gguf/build/metal/bin/server gguf/build/metal/bin/ollama-runner
--- a/llm/llama.cpp/generate_linux.go
+++ b/llm/llama.cpp/generate_linux.go
@@ -9,18 +9,14 @@ package llm
 //go:generate git -C ggml apply ../patches/0001-copy-cuda-runtime-libraries.patch
 //go:generate cmake -S ggml -B ggml/build/cpu -DLLAMA_K_QUANTS=on
 //go:generate cmake --build ggml/build/cpu --target server --config Release
-//go:generate mv ggml/build/cpu/bin/server ggml/build/cpu/bin/ollama-runner

 //go:generate git submodule update --force gguf
 //go:generate git -C gguf apply ../patches/0001-copy-cuda-runtime-libraries.patch
 //go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
 //go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on
 //go:generate cmake --build gguf/build/cpu --target server --config Release
-//go:generate mv gguf/build/cpu/bin/server gguf/build/cpu/bin/ollama-runner

 //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
 //go:generate cmake --build ggml/build/cuda --target server --config Release
-//go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
 //go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
 //go:generate cmake --build gguf/build/cuda --target server --config Release
-//go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner
--- a/llm/llama.cpp/generate_windows.go
+++ b/llm/llama.cpp/generate_windows.go
@@ -7,10 +7,8 @@ package llm
 //go:generate git -C ggml apply ../patches/0002-34B-model-support.patch
 //go:generate cmake -S ggml -B ggml/build/cpu -DLLAMA_K_QUANTS=on
 //go:generate cmake --build ggml/build/cpu --target server --config Release
-//go:generate cmd /c move ggml\build\cpu\bin\Release\server.exe ggml\build\cpu\bin\Release\ollama-runner.exe

 //go:generate git submodule update --force gguf
 //go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
 //go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on
 //go:generate cmake --build gguf/build/cpu --target server --config Release
-//go:generate cmd /c move gguf\build\cpu\bin\Release\server.exe gguf\build\cpu\bin\Release\ollama-runner.exe
--- a/llm/llama.cpp/gguf
+++ b/llm/llama.cpp/gguf
--- a/llm/llama.cpp/patches/0001-remove-warm-up-logging.patch
+++ b/llm/llama.cpp/patches/0001-remove-warm-up-logging.patch
@@ -1,6 +1,6 @@
-From 8dbb5449db259a9c24796e7927d89bee98b6c8f5 Mon Sep 17 00:00:00 2001
-From: Bruce MacDonald <brucewmacdonald@gmail.com>
-Date: Thu, 5 Oct 2023 11:21:12 -0400
+From 07993bdc35345b67b27aa649a7c099ad42d80c4c Mon Sep 17 00:00:00 2001
+From: Michael Yang <mxyng@pm.me>
+Date: Thu, 21 Sep 2023 14:43:21 -0700
 Subject: [PATCH] remove warm up logging

 ---
@@ -8,18 +8,18 @@ Subject: [PATCH] remove warm up logging
 1 file changed, 2 deletions(-)

 diff --git a/common/common.cpp b/common/common.cpp
-index 7370017..c4433fe 100644
+index 2597ba0..b56549b 100644
 --- a/common/common.cpp
 +++ b/common/common.cpp
-@@ -839,8 +839,6 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
+@@ -780,8 +780,6 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
     }
 
     {
 -        LOG("warming up the model with an empty run\n");
 -
-         std::vector<llama_token> tmp = { llama_token_bos(lctx), llama_token_eos(lctx), };
-         llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
-         llama_kv_cache_tokens_rm(lctx, -1, -1);
+         const std::vector<llama_token> tmp = { llama_token_bos(lctx), llama_token_eos(lctx), };
+         llama_eval(lctx, tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, params.n_threads);
+         llama_reset_timings(lctx);
 -- 
-2.39.2 (Apple Git-143)
+2.42.0

--- a/llm/llama.go
+++ b/llm/llama.go
@@ -20,57 +20,54 @@ import (
 	"runtime"
 	"strconv"
 	"strings"
-	"sync"
 	"time"

 	"github.com/jmorganca/ollama/api"
-	"github.com/jmorganca/ollama/format"
 )

 //go:embed llama.cpp/*/build/*/bin/*
 var llamaCppEmbed embed.FS

 type ModelRunner struct {
-	Path        string // path to the model runner executable
-	Accelerated bool
+	Path string // path to the model runner executable
 }

 func chooseRunners(workDir, runnerType string) []ModelRunner {
 	buildPath := path.Join("llama.cpp", runnerType, "build")
-	var runners []ModelRunner
+	var runners []string

 	// set the runners based on the OS
 	// IMPORTANT: the order of the runners in the array is the priority order
 	switch runtime.GOOS {
 	case "darwin":
-		runners = []ModelRunner{
-			{Path: path.Join(buildPath, "metal", "bin", "ollama-runner")},
-			{Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
+		runners = []string{
+			path.Join(buildPath, "metal", "bin", "server"),
+			path.Join(buildPath, "cpu", "bin", "server"),
 		}
 	case "linux":
-		runners = []ModelRunner{
-			{Path: path.Join(buildPath, "cuda", "bin", "ollama-runner"), Accelerated: true},
-			{Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
+		runners = []string{
+			path.Join(buildPath, "cuda", "bin", "server"),
+			path.Join(buildPath, "cpu", "bin", "server"),
 		}
 	case "windows":
 		// TODO: select windows GPU runner here when available
-		runners = []ModelRunner{
-			{Path: path.Join(buildPath, "cpu", "bin", "Release", "ollama-runner.exe")},
+		runners = []string{
+			path.Join(buildPath, "cpu", "bin", "Release", "server.exe"),
 		}
 	default:
 		log.Printf("unknown OS, running on CPU: %s", runtime.GOOS)
-		runners = []ModelRunner{
-			{Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
+		runners = []string{
+			path.Join(buildPath, "cpu", "bin", "server"),
 		}
 	}

 	runnerAvailable := false // if no runner files are found in the embed, this flag will cause a fast fail
 	for _, r := range runners {
 		// find all the files in the runner's bin directory
-		files, err := fs.Glob(llamaCppEmbed, path.Join(path.Dir(r.Path), "*"))
+		files, err := fs.Glob(llamaCppEmbed, path.Join(path.Dir(r), "*"))
 		if err != nil {
 			// this is expected, ollama may be compiled without all runners packed in
-			log.Printf("%s runner not found: %v", r.Path, err)
+			log.Printf("%s runner not found: %v", r, err)
 			continue
 		}

@@ -117,10 +114,7 @@ func chooseRunners(workDir, runnerType string) []ModelRunner {
 	localRunnersByPriority := []ModelRunner{}
 	for _, r := range runners {
 		// clean the ModelRunner paths so that they match the OS we are running on
-		localRunnersByPriority = append(localRunnersByPriority, ModelRunner{
-			Path:        filepath.Clean(path.Join(workDir, r.Path)),
-			Accelerated: r.Accelerated,
-		})
+		localRunnersByPriority = append(localRunnersByPriority, ModelRunner{Path: filepath.Clean(path.Join(workDir, r))})
 	}

 	return localRunnersByPriority
@@ -149,7 +143,7 @@ func llamaModelType(numLayer uint32) string {
 	case 80:
 		return "65B"
 	default:
-		return "unknown"
+		return "Unknown"
 	}
 }

@@ -183,12 +177,9 @@ type llamaHyperparameters struct {
 }

 type Running struct {
-	Port          int
-	Cmd           *exec.Cmd
-	Cancel        context.CancelFunc
-	exitOnce      sync.Once
-	exitCh        chan error // channel to receive the exit status of the subprocess
-	*StatusWriter            // captures error messages from the llama runner process
+	Port   int
+	Cmd    *exec.Cmd
+	Cancel context.CancelFunc
 }

 type llama struct {
@@ -198,9 +189,9 @@ type llama struct {

 var errNoGPU = errors.New("nvidia-smi command failed")

-// CheckVRAM returns the free VRAM in bytes on Linux machines with NVIDIA GPUs
+// CheckVRAM returns the available VRAM in MiB on Linux machines with NVIDIA GPUs
 func CheckVRAM() (int64, error) {
-	cmd := exec.Command("nvidia-smi", "--query-gpu=memory.free", "--format=csv,noheader,nounits")
+	cmd := exec.Command("nvidia-smi", "--query-gpu=memory.total", "--format=csv,noheader,nounits")
 	var stdout bytes.Buffer
 	cmd.Stdout = &stdout
 	err := cmd.Run()
@@ -208,7 +199,7 @@ func CheckVRAM() (int64, error) {
 		return 0, errNoGPU
 	}

-	var freeMiB int64
+	var total int64
 	scanner := bufio.NewScanner(&stdout)
 	for scanner.Scan() {
 		line := scanner.Text()
@@ -217,24 +208,19 @@ func CheckVRAM() (int64, error) {
 			return 0, fmt.Errorf("failed to parse available VRAM: %v", err)
 		}

-		freeMiB += vram
+		total += vram
 	}

-	freeBytes := freeMiB * 1024 * 1024
-	if freeBytes < 2*format.GigaByte {
-		log.Printf("less than 2 GB VRAM available, falling back to CPU only")
-		freeMiB = 0
-	}
-
-	return freeBytes, nil
+	return total, nil
 }

 func NumGPU(numLayer, fileSizeBytes int64, opts api.Options) int {
 	if opts.NumGPU != -1 {
 		return opts.NumGPU
 	}
+	n := 1 // default to enable metal on macOS
 	if runtime.GOOS == "linux" {
-		freeBytes, err := CheckVRAM()
+		vramMib, err := CheckVRAM()
 		if err != nil {
 			if err.Error() != "nvidia-smi command failed" {
 				log.Print(err.Error())
@@ -243,48 +229,21 @@ func NumGPU(numLayer, fileSizeBytes int64, opts api.Options) int {
 			return 0
 		}

+		totalVramBytes := int64(vramMib) * 1024 * 1024 // 1 MiB = 1024^2 bytes
+
 		// Calculate bytes per layer
 		// TODO: this is a rough heuristic, better would be to calculate this based on number of layers and context size
 		bytesPerLayer := fileSizeBytes / numLayer

-		// max number of layers we can fit in VRAM, subtract 8% to prevent consuming all available VRAM and running out of memory
-		layers := int(freeBytes/bytesPerLayer) * 92 / 100
-		log.Printf("%d MB VRAM available, loading up to %d GPU layers", freeBytes/(1024*1024), layers)
+		// set n to the max number of layers we can fit in VRAM
+		return int(totalVramBytes / bytesPerLayer)

-		return layers
+		log.Printf("%d MiB VRAM available, loading up to %d GPU layers", vramMib, n)
 	}
 	// default to enable metal on macOS
 	return 1
 }

-// StatusWriter is a writer that captures error messages from the llama runner process
-type StatusWriter struct {
-	ErrCh      chan error
-	LastErrMsg string
-}
-
-func NewStatusWriter() *StatusWriter {
-	return &StatusWriter{
-		ErrCh: make(chan error, 1),
-	}
-}
-
-func (w *StatusWriter) Write(b []byte) (int, error) {
-	var errMsg string
-	if _, after, ok := bytes.Cut(b, []byte("error:")); ok {
-		errMsg = string(bytes.TrimSpace(after))
-	} else if _, after, ok := bytes.Cut(b, []byte("CUDA error")); ok {
-		errMsg = string(bytes.TrimSpace(after))
-	}
-
-	if errMsg != "" {
-		w.LastErrMsg = errMsg
-		w.ErrCh <- fmt.Errorf("llama runner: %s", errMsg)
-	}
-
-	return os.Stderr.Write(b)
-}
-
 func newLlama(model string, adapters []string, runners []ModelRunner, numLayers int64, opts api.Options) (*llama, error) {
 	fileInfo, err := os.Stat(model)
 	if err != nil {
@@ -295,14 +254,13 @@ func newLlama(model string, adapters []string, runners []ModelRunner, numLayers
 		return nil, errors.New("ollama supports only one lora adapter, but multiple were provided")
 	}

-	numGPU := NumGPU(numLayers, fileInfo.Size(), opts)
 	params := []string{
 		"--model", model,
 		"--ctx-size", fmt.Sprintf("%d", opts.NumCtx),
 		"--rope-freq-base", fmt.Sprintf("%f", opts.RopeFrequencyBase),
 		"--rope-freq-scale", fmt.Sprintf("%f", opts.RopeFrequencyScale),
 		"--batch-size", fmt.Sprintf("%d", opts.NumBatch),
-		"--n-gpu-layers", fmt.Sprintf("%d", numGPU),
+		"--n-gpu-layers", fmt.Sprintf("%d", NumGPU(numLayers, fileInfo.Size(), opts)),
 		"--embedding",
 	}

@@ -332,15 +290,8 @@ func newLlama(model string, adapters []string, runners []ModelRunner, numLayers
 		params = append(params, "--numa")
 	}

-	var runnerErr error
-
 	// start the llama.cpp server with a retry in case the port is already in use
 	for _, runner := range runners {
-		if runner.Accelerated && numGPU == 0 {
-			log.Printf("skipping accelerated runner because num_gpu=0")
-			continue
-		}
-
 		if _, err := os.Stat(runner.Path); err != nil {
 			log.Printf("llama runner not found: %v", err)
 			continue
@@ -355,10 +306,9 @@ func newLlama(model string, adapters []string, runners []ModelRunner, numLayers
 		)
 		cmd.Env = append(os.Environ(), fmt.Sprintf("LD_LIBRARY_PATH=%s", filepath.Dir(runner.Path)))
 		cmd.Stdout = os.Stderr
-		statusWriter := NewStatusWriter()
-		cmd.Stderr = statusWriter
+		cmd.Stderr = os.Stderr

-		llm := &llama{Options: opts, Running: Running{Port: port, Cmd: cmd, Cancel: cancel, exitCh: make(chan error)}}
+		llm := &llama{Options: opts, Running: Running{Port: port, Cmd: cmd, Cancel: cancel}}

 		log.Print("starting llama runner")
 		if err := llm.Cmd.Start(); err != nil {
@@ -366,36 +316,19 @@ func newLlama(model string, adapters []string, runners []ModelRunner, numLayers
 			continue
 		}

-		// monitor the llama runner process and signal when it exits
+		// monitor the command, it is blocking, so if it exits we need to capture that
 		go func() {
-			err := llm.Cmd.Wait()
-			// default to printing the exit message of the command process, it will probably just say 'exit staus 1'
-			errMsg := err.Error()
-			// try to set a better error message if llama runner logs captured an error
-			if statusWriter.LastErrMsg != "" {
-				errMsg = statusWriter.LastErrMsg
+			err := llm.Cmd.Wait() // this will block until the command exits
+			if err != nil {
+				log.Printf("llama runner exited with error: %v", err)
+			} else {
+				log.Printf("llama runner exited")
 			}
-			log.Println(errMsg)
-			// llm.Cmd.Wait() can only be called once, use this exit channel to signal that the process has exited
-			llm.exitOnce.Do(func() {
-				close(llm.exitCh)
-			})
 		}()

 		if err := waitForServer(llm); err != nil {
 			log.Printf("error starting llama runner: %v", err)
 			llm.Close()
-
-			// default the runnerErr to the error returned by the most recent llama runner process
-			runnerErr = err
-
-			// capture the error directly from the runner process, if any
-			select {
-			case runnerErr = <-statusWriter.ErrCh:
-			default:
-				// the runner process probably timed out
-			}
-
 			// try again
 			continue
 		}
@@ -404,74 +337,109 @@ func newLlama(model string, adapters []string, runners []ModelRunner, numLayers
 		return llm, nil
 	}

-	if runnerErr != nil {
-		// this is the error returned from the llama runner process that failed most recently
-		return nil, runnerErr
-	}
-
 	return nil, fmt.Errorf("failed to start a llama runner")
 }

 func waitForServer(llm *llama) error {
+	// wait for the server to start responding
 	start := time.Now()
-	expiresAt := time.Now().Add(3 * time.Minute) // be generous with timeout, large models can take a while to load
+	expiresAt := time.Now().Add(2 * time.Minute) // be generous with timeout, large models can take a while to load
 	ticker := time.NewTicker(200 * time.Millisecond)
-	defer ticker.Stop()

 	log.Print("waiting for llama runner to start responding")
-	for {
-		select {
-		case <-llm.exitCh:
-			// failed to start subprocess
-			return fmt.Errorf("llama runner process has terminated")
-		case <-ticker.C:
-			if time.Now().After(expiresAt) {
-				// timeout
-				return fmt.Errorf("timed out waiting for llama runner to start")
-			}
+	for range ticker.C {
+		if time.Now().After(expiresAt) {
+			return fmt.Errorf("llama runner did not start within alloted time, retrying")
+		}

-			if err := llm.Ping(context.Background()); err == nil {
-				// success
-				log.Printf("llama runner started in %f seconds", time.Since(start).Seconds())
-				return nil
-			}
+		// check if the server process has terminated
+		if llm.Cmd.ProcessState != nil && llm.Cmd.ProcessState.Exited() {
+			return fmt.Errorf("llama runner process has terminated")
+		}
+
+		if err := llm.Ping(context.Background()); err == nil {
+			break
 		}
 	}
+
+	log.Printf("llama runner started in %f seconds", time.Since(start).Seconds())
+	return nil
 }

 func (llm *llama) Close() {
-	// signal the sub-process to terminate
 	llm.Cancel()
-
-	// wait for the command to exit to prevent race conditions with the next run
-	<-llm.exitCh
-
-	if llm.StatusWriter != nil && llm.StatusWriter.LastErrMsg != "" {
-		log.Printf("llama runner stopped with error: %v", llm.StatusWriter.LastErrMsg)
-	} else {
-		log.Print("llama runner stopped successfully")
-	}
 }

 func (llm *llama) SetOptions(opts api.Options) {
 	llm.Options = opts
 }

-type prediction struct {
+type GenerationSettings struct {
+	FrequencyPenalty float64       `json:"frequency_penalty"`
+	IgnoreEOS        bool          `json:"ignore_eos"`
+	LogitBias        []interface{} `json:"logit_bias"`
+	Mirostat         int           `json:"mirostat"`
+	MirostatEta      float64       `json:"mirostat_eta"`
+	MirostatTau      float64       `json:"mirostat_tau"`
+	Model            string        `json:"model"`
+	NCtx             int           `json:"n_ctx"`
+	NKeep            int           `json:"n_keep"`
+	NPredict         int           `json:"n_predict"`
+	NProbs           int           `json:"n_probs"`
+	PenalizeNl       bool          `json:"penalize_nl"`
+	PresencePenalty  float64       `json:"presence_penalty"`
+	RepeatLastN      int           `json:"repeat_last_n"`
+	RepeatPenalty    float64       `json:"repeat_penalty"`
+	Seed             uint32        `json:"seed"`
+	Stop             []string      `json:"stop"`
+	Stream           bool          `json:"stream"`
+	Temp             float64       `json:"temp"`
+	TfsZ             float64       `json:"tfs_z"`
+	TopK             int           `json:"top_k"`
+	TopP             float64       `json:"top_p"`
+	TypicalP         float64       `json:"typical_p"`
+}
+
+type Timings struct {
+	PredictedN  int     `json:"predicted_n"`
+	PredictedMS float64 `json:"predicted_ms"`
+	PromptN     int     `json:"prompt_n"`
+	PromptMS    float64 `json:"prompt_ms"`
+}
+
+type Prediction struct {
 	Content string `json:"content"`
 	Model   string `json:"model"`
 	Prompt  string `json:"prompt"`
 	Stop    bool   `json:"stop"`

-	Timings struct {
-		PredictedN  int     `json:"predicted_n"`
-		PredictedMS float64 `json:"predicted_ms"`
-		PromptN     int     `json:"prompt_n"`
-		PromptMS    float64 `json:"prompt_ms"`
-	}
+	Timings `json:"timings"`
 }

-const maxBufferSize = 512 * format.KiloByte
+type PredictRequest struct {
+	Stream           bool            `json:"stream"`
+	NPredict         int             `json:"n_predict,omitempty"`
+	TopK             int             `json:"top_k,omitempty"`
+	TopP             float32         `json:"top_p,omitempty"`
+	TfsZ             float32         `json:"tfs_z,omitempty"`
+	TypicalP         float32         `json:"typical_p,omitempty"`
+	RepeatLastN      int             `json:"repeat_last_n,omitempty"`
+	Temperature      float32         `json:"temperature,omitempty"`
+	RepeatPenalty    float32         `json:"repeat_penalty,omitempty"`
+	PresencePenalty  float32         `json:"presence_penalty,omitempty"`
+	FrequencyPenalty float32         `json:"frequency_penalty,omitempty"`
+	Mirostat         int             `json:"mirostat,omitempty"`
+	MirostatTau      float32         `json:"mirostat_tau,omitempty"`
+	MirostatEta      float32         `json:"mirostat_eta,omitempty"`
+	PenalizeNl       bool            `json:"penalize_nl,omitempty"`
+	NKeep            int             `json:"n_keep,omitempty"`
+	Seed             int             `json:"seed,omitempty"`
+	Prompt           string          `json:"prompt,omitempty"`
+	NProbs           int             `json:"n_probs,omitempty"`
+	LogitBias        map[int]float32 `json:"logit_bias,omitempty"`
+	IgnoreEos        bool            `json:"ignore_eos,omitempty"`
+	Stop             []string        `json:"stop,omitempty"`
+}

 func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string, fn func(api.GenerateResponse)) error {
 	prevConvo, err := llm.Decode(ctx, prevContext)
@@ -479,46 +447,37 @@ func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string,
 		return err
 	}

-	// Remove leading spaces from prevConvo if present
-	prevConvo = strings.TrimPrefix(prevConvo, " ")
-
 	var nextContext strings.Builder
 	nextContext.WriteString(prevConvo)
 	nextContext.WriteString(prompt)

-	request := map[string]any{
-		"prompt":            nextContext.String(),
-		"stream":            true,
-		"n_predict":         llm.NumPredict,
-		"n_keep":            llm.NumKeep,
-		"temperature":       llm.Temperature,
-		"top_k":             llm.TopK,
-		"top_p":             llm.TopP,
-		"tfs_z":             llm.TFSZ,
-		"typical_p":         llm.TypicalP,
-		"repeat_last_n":     llm.RepeatLastN,
-		"repeat_penalty":    llm.RepeatPenalty,
-		"presence_penalty":  llm.PresencePenalty,
-		"frequency_penalty": llm.FrequencyPenalty,
-		"mirostat":          llm.Mirostat,
-		"mirostat_tau":      llm.MirostatTau,
-		"mirostat_eta":      llm.MirostatEta,
-		"penalize_nl":       llm.PenalizeNewline,
-		"seed":              llm.Seed,
-		"stop":              llm.Stop,
-	}
-
-	// Handling JSON marshaling with special characters unescaped.
-	buffer := &bytes.Buffer{}
-	enc := json.NewEncoder(buffer)
-	enc.SetEscapeHTML(false)
-
-	if err := enc.Encode(request); err != nil {
-		return fmt.Errorf("failed to marshal data: %v", err)
-	}
-
 	endpoint := fmt.Sprintf("http://127.0.0.1:%d/completion", llm.Port)
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, buffer)
+	predReq := PredictRequest{
+		Prompt:           nextContext.String(),
+		Stream:           true,
+		NPredict:         llm.NumPredict,
+		NKeep:            llm.NumKeep,
+		Temperature:      llm.Temperature,
+		TopK:             llm.TopK,
+		TopP:             llm.TopP,
+		TfsZ:             llm.TFSZ,
+		TypicalP:         llm.TypicalP,
+		RepeatLastN:      llm.RepeatLastN,
+		RepeatPenalty:    llm.RepeatPenalty,
+		PresencePenalty:  llm.PresencePenalty,
+		FrequencyPenalty: llm.FrequencyPenalty,
+		Mirostat:         llm.Mirostat,
+		MirostatTau:      llm.MirostatTau,
+		MirostatEta:      llm.MirostatEta,
+		PenalizeNl:       llm.PenalizeNewline,
+		Stop:             llm.Stop,
+	}
+	data, err := json.Marshal(predReq)
+	if err != nil {
+		return fmt.Errorf("error marshaling data: %v", err)
+	}
+
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewBuffer(data))
 	if err != nil {
 		return fmt.Errorf("error creating POST request: %v", err)
 	}
@@ -540,23 +499,22 @@ func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string,
 	}

 	scanner := bufio.NewScanner(resp.Body)
-	// increase the buffer size to avoid running out of space
-	buf := make([]byte, 0, maxBufferSize)
-	scanner.Buffer(buf, maxBufferSize)
 	for scanner.Scan() {
 		select {
 		case <-ctx.Done():
 			// This handles the request cancellation
 			return ctx.Err()
 		default:
-			line := scanner.Bytes()
-			if len(line) == 0 {
+			line := scanner.Text()
+			if line == "" {
 				continue
 			}

-			if evt, ok := bytes.CutPrefix(line, []byte("data: ")); ok {
-				var p prediction
-				if err := json.Unmarshal(evt, &p); err != nil {
+			// Read data from the server-side event stream
+			if strings.HasPrefix(line, "data: ") {
+				evt := line[6:]
+				var p Prediction
+				if err := json.Unmarshal([]byte(evt), &p); err != nil {
 					return fmt.Errorf("error unmarshaling llm prediction response: %v", err)
 				}

@@ -574,10 +532,10 @@ func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string,
 					fn(api.GenerateResponse{
 						Done:               true,
 						Context:            embd,
-						PromptEvalCount:    p.Timings.PromptN,
-						PromptEvalDuration: parseDurationMs(p.Timings.PromptMS),
-						EvalCount:          p.Timings.PredictedN,
-						EvalDuration:       parseDurationMs(p.Timings.PredictedMS),
+						PromptEvalCount:    p.PromptN,
+						PromptEvalDuration: parseDurationMs(p.PromptMS),
+						EvalCount:          p.PredictedN,
+						EvalDuration:       parseDurationMs(p.PredictedMS),
 					})

 					return nil
@@ -587,14 +545,6 @@ func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string,
 	}

 	if err := scanner.Err(); err != nil {
-		if strings.Contains(err.Error(), "unexpected EOF") {
-			// this means the llama runner subprocess crashed
-			llm.Close()
-			if llm.StatusWriter != nil && llm.StatusWriter.LastErrMsg != "" {
-				return fmt.Errorf("llama runner exited: %v", llm.StatusWriter.LastErrMsg)
-			}
-			return fmt.Errorf("llama runner exited, you may not have enough available memory to run this model")
-		}
 		return fmt.Errorf("error reading llm response: %v", err)
 	}

@@ -691,6 +641,9 @@ func (llm *llama) Decode(ctx context.Context, tokens []int) (string, error) {
 		return "", fmt.Errorf("unmarshal encode response: %w", err)
 	}

+	// decoded content contains a leading whitespace
+	decoded.Content, _ = strings.CutPrefix(decoded.Content, "")
+
 	return decoded.Content, nil
 }

--- a/llm/llm.go
+++ b/llm/llm.go
@@ -5,12 +5,10 @@ import (
 	"fmt"
 	"log"
 	"os"
-	"runtime"

 	"github.com/pbnjay/memory"

 	"github.com/jmorganca/ollama/api"
-	"github.com/jmorganca/ollama/format"
 )

 type LLM interface {
@@ -39,47 +37,54 @@ func New(workDir, model string, adapters []string, opts api.Options) (LLM, error
 		return nil, err
 	}

-	if runtime.GOOS == "darwin" {
-		switch ggml.FileType() {
-		case "Q8_0":
-			if ggml.Name() != "gguf" && opts.NumGPU != 0 {
-				// GGML Q8_0 do not support Metal API and will
-				// cause the runner to segmentation fault so disable GPU
-				log.Printf("WARNING: GPU disabled for F32, Q5_0, Q5_1, and Q8_0")
-				opts.NumGPU = 0
-			}
-		case "F32", "Q5_0", "Q5_1":
-			if opts.NumGPU != 0 {
-				// F32, Q5_0, Q5_1, and Q8_0 do not support Metal API and will
-				// cause the runner to segmentation fault so disable GPU
-				log.Printf("WARNING: GPU disabled for F32, Q5_0, Q5_1, and Q8_0")
-				opts.NumGPU = 0
-			}
+	switch ggml.FileType() {
+	case "Q8_0":
+		if ggml.Name() != "gguf" && opts.NumGPU != 0 {
+			// GGML Q8_0 do not support Metal API and will
+			// cause the runner to segmentation fault so disable GPU
+			log.Printf("WARNING: GPU disabled for F32, Q5_0, Q5_1, and Q8_0")
+			opts.NumGPU = 0
 		}
-
-		var requiredMemory int64
-		var f16Multiplier int64 = 2
-
-		switch ggml.ModelType() {
-		case "3B", "7B":
-			requiredMemory = 8 * format.GigaByte
-		case "13B":
-			requiredMemory = 16 * format.GigaByte
-		case "30B", "34B", "40B":
-			requiredMemory = 32 * format.GigaByte
-		case "65B", "70B":
-			requiredMemory = 64 * format.GigaByte
-		case "180B":
-			requiredMemory = 128 * format.GigaByte
-			f16Multiplier = 4
+	case "F32", "Q5_0", "Q5_1":
+		if opts.NumGPU != 0 {
+			// F32, Q5_0, Q5_1, and Q8_0 do not support Metal API and will
+			// cause the runner to segmentation fault so disable GPU
+			log.Printf("WARNING: GPU disabled for F32, Q5_0, Q5_1, and Q8_0")
+			opts.NumGPU = 0
 		}
+	}

-		systemMemory := int64(memory.TotalMemory())
-
-		if ggml.FileType() == "F16" && requiredMemory*f16Multiplier > systemMemory {
-			return nil, fmt.Errorf("F16 model requires at least %s of total memory", format.HumanBytes(requiredMemory))
-		} else if requiredMemory > systemMemory {
-			return nil, fmt.Errorf("model requires at least %s of total memory", format.HumanBytes(requiredMemory))
+	totalResidentMemory := memory.TotalMemory()
+	switch ggml.ModelType() {
+	case "3B", "7B":
+		if ggml.FileType() == "F16" && totalResidentMemory < 16*1024*1024 {
+			return nil, fmt.Errorf("F16 model requires at least 16GB of memory")
+		} else if totalResidentMemory < 8*1024*1024 {
+			return nil, fmt.Errorf("model requires at least 8GB of memory")
+		}
+	case "13B":
+		if ggml.FileType() == "F16" && totalResidentMemory < 32*1024*1024 {
+			return nil, fmt.Errorf("F16 model requires at least 32GB of memory")
+		} else if totalResidentMemory < 16*1024*1024 {
+			return nil, fmt.Errorf("model requires at least 16GB of memory")
+		}
+	case "30B", "34B", "40B":
+		if ggml.FileType() == "F16" && totalResidentMemory < 64*1024*1024 {
+			return nil, fmt.Errorf("F16 model requires at least 64GB of memory")
+		} else if totalResidentMemory < 32*1024*1024 {
+			return nil, fmt.Errorf("model requires at least 32GB of memory")
+		}
+	case "65B", "70B":
+		if ggml.FileType() == "F16" && totalResidentMemory < 128*1024*1024 {
+			return nil, fmt.Errorf("F16 model requires at least 128GB of memory")
+		} else if totalResidentMemory < 64*1024*1024 {
+			return nil, fmt.Errorf("model requires at least 64GB of memory")
+		}
+	case "180B":
+		if ggml.FileType() == "F16" && totalResidentMemory < 512*1024*1024 {
+			return nil, fmt.Errorf("F16 model requires at least 512GB of memory")
+		} else if totalResidentMemory < 128*1024*1024 {
+			return nil, fmt.Errorf("model requires at least 128GB of memory")
 		}
 	}

--- a/llm/starcoder.go
+++ b/llm/starcoder.go
@@ -1,23 +0,0 @@
-package llm
-
-const (
-	starCoderModelType1B  = 24
-	starCoderModelType3B  = 36
-	starCoderModelType7B  = 42
-	starCoderModelType15B = 40
-)
-
-func starCoderModelType(numLayer uint32) string {
-	switch numLayer {
-	case 24:
-		return "1B"
-	case 36:
-		return "3B"
-	case 42:
-		return "7B"
-	case 40:
-		return "15B"
-	default:
-		return "unknown"
-	}
-}
--- a/parser/parser.go
+++ b/parser/parser.go
@@ -40,7 +40,7 @@ func Parse(reader io.Reader) ([]Command, error) {
 			command.Args = string(fields[1])
 			// copy command for validation
 			modelCommand = command
-		case "LICENSE", "TEMPLATE", "SYSTEM", "PROMPT", "ADAPTER":
+		case "LICENSE", "TEMPLATE", "SYSTEM", "PROMPT", "EMBED", "ADAPTER":
 			command.Name = string(bytes.ToLower(fields[0]))
 			command.Args = string(fields[1])
 		case "PARAMETER":
@@ -51,8 +51,6 @@ func Parse(reader io.Reader) ([]Command, error) {

 			command.Name = string(fields[0])
 			command.Args = string(fields[1])
-		case "EMBED":
-			return nil, fmt.Errorf("deprecated command: EMBED is no longer supported, use the /embed API endpoint instead")
 		default:
 			if !bytes.HasPrefix(fields[0], []byte("#")) {
 				// log a warning for unknown commands
--- a/runner/.gitignore
+++ b/runner/.gitignore
@@ -1,2 +0,0 @@
-model.bin
-runner
--- a/runner/darwin.go
+++ b/runner/darwin.go
@@ -1,39 +0,0 @@
-package main
-
-import (
-	"embed"
-	"io"
-	"os"
-	"path/filepath"
-)
-
-//go:embed ggml-metal.metal
-var fs embed.FS
-
-func init() {
-	exec, err := os.Executable()
-	if err != nil {
-		return
-	}
-
-	exec, err = filepath.EvalSymlinks(exec)
-	if err != nil {
-		return
-	}
-
-	dst, err := os.Create(filepath.Join(filepath.Dir(exec), "ggml-metal.metal"))
-	if err != nil {
-		return
-	}
-	defer dst.Close()
-
-	src, err := fs.Open("ggml-metal.metal")
-	if err != nil {
-		return
-	}
-	defer src.Close()
-
-	if _, err := io.Copy(dst, src); err != nil {
-		return
-	}
-}
--- a/runner/ggml-alloc.c
+++ b/runner/ggml-alloc.c
@@ -1,620 +0,0 @@
-/**
- * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
- *
- * MIT License
- *
- * Copyright (c) 2023 Georgi Gerganov
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#include "ggml-alloc.h"
-#include "ggml-backend.h"
-#include "ggml.h"
-#include <assert.h>
-#include <stdarg.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-
-
-#define UNUSED(x) (void)(x)
-#define MAX(a, b) ((a) > (b) ? (a) : (b))
-#define GGML_MAX_CONCUR (2*GGML_MAX_NODES)
-
-//#define GGML_ALLOCATOR_DEBUG
-
-//#define AT_PRINTF printf
-#define AT_PRINTF(...) ((void)0)
-
-struct hash_node {
-    struct ggml_tensor * t;
-    int n_children;
-    int n_views;
-};
-
-static size_t hash(void * p) {
-    return (size_t)p % GGML_GRAPH_HASHTABLE_SIZE;
-}
-
-static struct hash_node * hash_get(struct hash_node hash_table[], struct ggml_tensor * t) {
-    size_t h = hash(t);
-
-    // linear probing
-    size_t i = h;
-    while (hash_table[i].t != NULL) {
-        if (hash_table[i].t == t) {
-            return &hash_table[i];
-        }
-        i = (i + 1) % GGML_GRAPH_HASHTABLE_SIZE;
-        if (i == h) {
-            // hash table is full
-            GGML_ASSERT(false);
-        }
-    }
-
-    hash_table[i].t = t;
-    return &hash_table[i];
-}
-
-// TODO: GGML_PAD ?
-static size_t aligned_offset(const void * buffer, size_t offset, size_t alignment) {
-    assert(alignment && !(alignment & (alignment - 1))); // power of 2
-    size_t align = (alignment - (((uintptr_t)buffer + offset) % alignment)) % alignment;
-    return offset + align;
-}
-
-struct free_block {
-    void * addr;
-    size_t size;
-};
-
-#define MAX_FREE_BLOCKS 256
-
-struct ggml_allocr {
-    struct ggml_backend_buffer * buffer;
-    bool buffer_owned;
-    void * data;
-    size_t alignment;
-    int n_free_blocks;
-    struct free_block free_blocks[MAX_FREE_BLOCKS];
-    struct hash_node hash_table[GGML_GRAPH_HASHTABLE_SIZE];
-    size_t max_size;
-    bool measure;
-    int parse_seq[GGML_MAX_CONCUR];
-    int parse_seq_len;
-
-#ifdef GGML_ALLOCATOR_DEBUG
-    struct ggml_tensor * allocated_tensors[1024];
-#endif
-};
-
-#ifdef GGML_ALLOCATOR_DEBUG
-static void add_allocated_tensor(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
-    for (int i = 0; i < 1024; i++) {
-        if (alloc->allocated_tensors[i] == NULL) {
-            alloc->allocated_tensors[i] = tensor;
-            return;
-        }
-    }
-    GGML_ASSERT(!"out of allocated_tensors");
-}
-static void remove_allocated_tensor(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
-    for (int i = 0; i < 1024; i++) {
-        if (alloc->allocated_tensors[i] == tensor ||
-            (alloc->allocated_tensors[i] != NULL && alloc->allocated_tensors[i]->data == tensor->data)) {
-            alloc->allocated_tensors[i] = NULL;
-            return;
-        }
-    }
-    printf("tried to free tensor %s not found\n", tensor->name);
-    GGML_ASSERT(!"tensor not found");
-}
-#endif
-
-// check if a tensor is allocated by this buffer
-static bool ggml_allocr_is_own(struct ggml_allocr * alloc, const struct ggml_tensor * tensor) {
-    return tensor->buffer == alloc->buffer;
-}
-
-static bool ggml_is_view(struct ggml_tensor * t) {
-    return t->view_src != NULL;
-}
-
-void ggml_allocr_alloc(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
-    GGML_ASSERT(!ggml_is_view(tensor)); // views generally get data pointer from one of their sources
-    GGML_ASSERT(tensor->data == NULL); // avoid allocating tensor which already has memory allocated
-
-    size_t size = ggml_backend_buffer_get_alloc_size(alloc->buffer, tensor);
-    size = aligned_offset(NULL, size, alloc->alignment);
-
-    AT_PRINTF("%s: allocating %s (%zu bytes) - ", __func__, tensor->name, size);
-
-    size_t max_avail = 0;
-
-    // find the best fitting free block besides the last block
-    int best_fit_block = -1;
-    size_t best_fit_size = SIZE_MAX;
-    for (int i = 0; i < alloc->n_free_blocks - 1; i++) {
-        struct free_block * block = &alloc->free_blocks[i];
-        max_avail = MAX(max_avail, block->size);
-        if (block->size >= size && block->size <= best_fit_size) {
-            best_fit_block = i;
-            best_fit_size = block->size;
-        }
-    }
-
-    AT_PRINTF("block %d\n", best_fit_block);
-
-    if (best_fit_block == -1) {
-        // the last block is our last resort
-        struct free_block * block = &alloc->free_blocks[alloc->n_free_blocks - 1];
-        max_avail = MAX(max_avail, block->size);
-        if (block->size >= size) {
-            best_fit_block = alloc->n_free_blocks - 1;
-        } else {
-            fprintf(stderr, "%s: not enough space in the buffer (needed %zu, largest block available %zu)\n",
-                    __func__, size, max_avail);
-            GGML_ASSERT(!"not enough space in the buffer");
-            return;
-        }
-    }
-    struct free_block * block = &alloc->free_blocks[best_fit_block];
-    void * addr = block->addr;
-    block->addr = (char*)block->addr + size;
-    block->size -= size;
-    if (block->size == 0) {
-        // remove block if empty
-        alloc->n_free_blocks--;
-        for (int j = best_fit_block; j < alloc->n_free_blocks; j++) {
-            alloc->free_blocks[j] = alloc->free_blocks[j+1];
-        }
-    }
-
-    tensor->data = addr;
-    AT_PRINTF("%s: allocated data at %p\n", __func__, tensor->data);
-    tensor->buffer = alloc->buffer;
-    ggml_backend_buffer_init_tensor(alloc->buffer, tensor);
-
-#ifdef GGML_ALLOCATOR_DEBUG
-    add_allocated_tensor(alloc, tensor);
-    size_t cur_max = (char*)addr - (char*)alloc->data + size;
-    if (cur_max > alloc->max_size) {
-        printf("max_size = %.2f MB: tensors: ", cur_max / 1024.0 / 1024.0);
-        for (int i = 0; i < 1024; i++) {
-            if (alloc->allocated_tensors[i]) {
-                printf("%s (%.2f MB) ", alloc->allocated_tensors[i]->name, ggml_nbytes(alloc->allocated_tensors[i]) / 1024.0 / 1024.0);
-            }
-        }
-        printf("\n");
-    }
-#endif
-
-    alloc->max_size = MAX(alloc->max_size, (char*)addr - (char*)alloc->data + size);
-}
-
-// this is a very naive implementation, but for our case the number of free blocks should be very small
-static void ggml_allocr_free_tensor(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
-    if (ggml_allocr_is_own(alloc, tensor) == false) {
-        // the tensor was not allocated in this buffer
-        // this can happen because the graph allocator will try to free weights and other tensors from different buffers
-        // the easiest way to deal with this is just to ignore it
-        AT_PRINTF("ignoring %s (their buffer: %p, our buffer: %p)\n", tensor->name, (void *)tensor->buffer, (void *)alloc->buffer);
-        return;
-    }
-
-    void * ptr = tensor->data;
-
-    size_t size = ggml_backend_buffer_get_alloc_size(alloc->buffer, tensor);
-    size = aligned_offset(NULL, size, alloc->alignment);
-    AT_PRINTF("%s: freeing %s at %p (%zu bytes) - n_free_blocks = %d\n", __func__, tensor->name, ptr, size, alloc->n_free_blocks);
-
-    ggml_backend_buffer_free_tensor(alloc->buffer, tensor);
-
-#ifdef GGML_ALLOCATOR_DEBUG
-    remove_allocated_tensor(alloc, tensor);
-#endif
-
-    // see if we can merge with an existing block
-    for (int i = 0; i < alloc->n_free_blocks; i++) {
-        struct free_block * block = &alloc->free_blocks[i];
-        // check if ptr is at the end of the block
-        if ((char*)block->addr + block->size == ptr) {
-            block->size += size;
-            // check if we can merge with the next block
-            if (i < alloc->n_free_blocks - 1 && (char*)block->addr + block->size == alloc->free_blocks[i+1].addr) {
-                block->size += alloc->free_blocks[i+1].size;
-                alloc->n_free_blocks--;
-                for (int j = i+1; j < alloc->n_free_blocks; j++) {
-                    alloc->free_blocks[j] = alloc->free_blocks[j+1];
-                }
-            }
-            return;
-        }
-        // check if ptr is at the beginning of the block
-        if ((char*)ptr + size == block->addr) {
-            block->addr = ptr;
-            block->size += size;
-            // check if we can merge with the previous block
-            if (i > 0 && (char*)alloc->free_blocks[i-1].addr + alloc->free_blocks[i-1].size == block->addr) {
-                alloc->free_blocks[i-1].size += block->size;
-                alloc->n_free_blocks--;
-                for (int j = i; j < alloc->n_free_blocks; j++) {
-                    alloc->free_blocks[j] = alloc->free_blocks[j+1];
-                }
-            }
-            return;
-        }
-    }
-    // otherwise, add a new block
-    GGML_ASSERT(alloc->n_free_blocks < MAX_FREE_BLOCKS && "out of free blocks");
-    // insert the new block in the correct position to keep the array sorted by address (to make merging blocks faster)
-    int insert_pos = 0;
-    while (insert_pos < alloc->n_free_blocks && alloc->free_blocks[insert_pos].addr < ptr) {
-        insert_pos++;
-    }
-    // shift all blocks from insert_pos onward to make room for the new block
-    for (int i = alloc->n_free_blocks; i > insert_pos; i--) {
-        alloc->free_blocks[i] = alloc->free_blocks[i-1];
-    }
-    // insert the new block
-    alloc->free_blocks[insert_pos].addr = ptr;
-    alloc->free_blocks[insert_pos].size = size;
-    alloc->n_free_blocks++;
-}
-
-void ggml_allocr_set_parse_seq(struct ggml_allocr * alloc, const int * list, int n) {
-    for (int i = 0; i < n; i++) {
-        alloc->parse_seq[i] = list[i];
-    }
-    alloc->parse_seq_len = n;
-}
-
-void ggml_allocr_reset(struct ggml_allocr * alloc) {
-    alloc->n_free_blocks = 1;
-    size_t align_offset = aligned_offset(alloc->data, 0, alloc->alignment);
-    alloc->free_blocks[0].addr = (char *)alloc->data + align_offset;
-    alloc->free_blocks[0].size = ggml_backend_buffer_get_size(alloc->buffer) - align_offset;
-}
-
-struct ggml_allocr * ggml_allocr_new(void * data, size_t size, size_t alignment) {
-    struct ggml_backend_buffer * buffer = ggml_backend_cpu_buffer_from_ptr(NULL, data, size);
-
-    struct ggml_allocr * alloc = (struct ggml_allocr *)malloc(sizeof(struct ggml_allocr));
-
-    *alloc = (struct ggml_allocr){
-        /*.buffer        = */ buffer,
-        /*.buffer_owned  = */ true,
-        /*.base          = */ ggml_backend_buffer_get_base(buffer),
-        /*.alignment     = */ alignment,
-        /*.n_free_blocks = */ 0,
-        /*.free_blocks   = */ {{0}},
-        /*.hash_table    = */ {{0}},
-        /*.max_size      = */ 0,
-        /*.measure       = */ false,
-        /*.parse_seq     = */ {0},
-        /*.parse_seq_len = */ 0,
-#ifdef GGML_ALLOCATOR_DEBUG
-        /*.allocated_tensors = */ {0},
-#endif
-    };
-
-    ggml_allocr_reset(alloc);
-
-    return alloc;
-}
-
-struct ggml_allocr * ggml_allocr_new_measure(size_t alignment) {
-    struct ggml_allocr * alloc = ggml_allocr_new((void *)0x1000, (size_t)-0x1001, alignment);
-    alloc->measure = true;
-
-    return alloc;
-}
-
-struct ggml_allocr * ggml_allocr_new_from_buffer(struct ggml_backend_buffer * buffer) {
-    struct ggml_allocr * alloc = (struct ggml_allocr *)malloc(sizeof(struct ggml_allocr));
-
-    *alloc = (struct ggml_allocr){
-        /*.buffer        = */ buffer,
-        /*.buffer_owned  = */ false,
-        /*.base          = */ ggml_backend_buffer_get_base(buffer),
-        /*.alignment     = */ ggml_backend_buffer_get_alignment(buffer),
-        /*.n_free_blocks = */ 0,
-        /*.free_blocks   = */ {{0}},
-        /*.hash_table    = */ {{0}},
-        /*.max_size      = */ 0,
-        /*.measure       = */ false,
-        /*.parse_seq     = */ {0},
-        /*.parse_seq_len = */ 0,
-#ifdef GGML_ALLOCATOR_DEBUG
-        /*.allocated_tensors = */ {0},
-#endif
-    };
-
-    ggml_allocr_reset(alloc);
-
-    return alloc;
-}
-
-void ggml_allocr_free(struct ggml_allocr * alloc) {
-    if (alloc->buffer_owned) {
-        ggml_backend_buffer_free(alloc->buffer);
-    }
-    free(alloc);
-}
-
-bool ggml_allocr_is_measure(struct ggml_allocr * alloc) {
-    return alloc->measure;
-}
-
-//////////// compute graph allocator
-
-static bool ggml_are_same_layout(const struct ggml_tensor * a, const struct ggml_tensor * b) {
-    if (a->type != b->type) {
-        return false;
-    }
-    for (int i = 0; i < GGML_MAX_DIMS; i++) {
-        if (a->ne[i] != b->ne[i]) {
-            return false;
-        }
-        if (a->nb[i] != b->nb[i]) {
-            return false;
-        }
-    }
-    return true;
-}
-
-static bool ggml_op_can_inplace(enum ggml_op op) {
-    switch (op) {
-        case GGML_OP_SCALE:
-        case GGML_OP_DIAG_MASK_ZERO:
-        case GGML_OP_DIAG_MASK_INF:
-        case GGML_OP_ADD:
-        case GGML_OP_ADD1:
-        case GGML_OP_SUB:
-        case GGML_OP_MUL:
-        case GGML_OP_DIV:
-        case GGML_OP_SQR:
-        case GGML_OP_SQRT:
-        case GGML_OP_LOG:
-        case GGML_OP_UNARY:
-        case GGML_OP_ROPE:
-        case GGML_OP_RMS_NORM:
-        case GGML_OP_SOFT_MAX:
-            return true;
-
-        default:
-            return false;
-    }
-}
-
-static void init_view(struct ggml_allocr * alloc, struct ggml_tensor * view) {
-    assert(view->view_src != NULL && view->view_src->data != NULL);
-    view->backend = view->view_src->backend;
-    view->buffer  = view->view_src->buffer;
-    view->data    = (char *)view->view_src->data + view->view_offs;
-
-    // FIXME: the view should be initialized by the owning buffer, but currently this breaks the CUDA backend
-    // due to the ggml_tensor_extra_gpu ring buffer overwriting the KV cache extras
-    assert(ggml_allocr_is_measure(alloc) || !view->buffer || view->buffer->backend == alloc->buffer->backend);
-    ggml_backend_buffer_init_tensor(alloc->buffer, view);
-}
-
-static void allocate_node(struct ggml_allocr * alloc, struct ggml_tensor * node) {
-    struct hash_node * ht = alloc->hash_table;
-    if (node->data == NULL) {
-        if (ggml_is_view(node)) {
-            init_view(alloc, node);
-        } else {
-            // see if we can reuse a parent's buffer (inplace)
-            if (ggml_op_can_inplace(node->op)) {
-                for (int i = 0; i < GGML_MAX_SRC; i++) {
-                    struct ggml_tensor * parent = node->src[i];
-                    if (parent == NULL) {
-                        break;
-                    }
-
-                    // if the node's data is external, then we cannot re-use it
-                    if (ggml_allocr_is_own(alloc, parent) == false) {
-                        AT_PRINTF("not reusing parent %s for %s as %p is external\n", parent->name, node->name, parent->data);
-                        continue;
-                    }
-
-                    struct hash_node * p_hn = hash_get(ht, parent);
-                    if (parent->data != NULL && p_hn->n_children == 1 && p_hn->n_views == 0 && ggml_are_same_layout(node, parent)) {
-                        if (ggml_is_view(parent)) {
-                            struct ggml_tensor * view_src = parent->view_src;
-                            struct hash_node * view_src_hn = hash_get(ht, view_src);
-                            if (view_src_hn->n_views == 1 && view_src_hn->n_children == 0 && view_src->data == parent->data) {
-                                // TODO: the offset of the view parent must be kept to ensure that the op doesn't overwrite
-                                // the parent's data that it will need later (same layout requirement). the problem is that then
-                                // we cannot free the tensor because the original address of the allocation is lost.
-                                // adding a view_src pointer to the tensor would solve this and simplify the code dealing with views
-                                // for now, we only reuse the parent's data if the offset is zero (view_src->data == parent->data)
-                                AT_PRINTF("reusing view parent %s (%s) for %s\n", parent->name, view_src->name, node->name);
-                                node->view_src = view_src;
-                                view_src_hn->n_views += 1;
-                                init_view(alloc, node);
-                                return;
-                            }
-                        }
-                        else {
-                            AT_PRINTF("reusing parent %s for %s\n", parent->name, node->name);
-                            node->view_src = parent;
-                            p_hn->n_views += 1;
-                            init_view(alloc, node);
-                            return;
-                        }
-                    }
-                }
-            }
-            ggml_allocr_alloc(alloc, node);
-        }
-    }
-}
-
-size_t ggml_allocr_alloc_graph_n(
-    struct ggml_allocr * alloc,
-    struct ggml_cgraph ** graphs, int n_graphs,
-    struct ggml_tensor *** inputs, struct ggml_tensor *** outputs) {
-
-    // reset hash table
-    struct hash_node * ht = alloc->hash_table;
-    memset(ht, 0, sizeof(struct hash_node) * GGML_GRAPH_HASHTABLE_SIZE);
-
-    // count number of children and views
-    for (int g = 0; g < n_graphs; g++) {
-        struct ggml_cgraph * gf = graphs[g];
-        for (int i = 0; i < gf->n_nodes; i++) {
-            struct ggml_tensor * node = gf->nodes[i];
-
-            if (ggml_is_view(node)) {
-                struct ggml_tensor * view_src = node->view_src;
-                hash_get(ht, view_src)->n_views += 1;
-                if (node->buffer == NULL && node->data != NULL) {
-                    // view of a pre-allocated tensor, didn't call init_view() yet
-                    init_view(alloc, node);
-                }
-            }
-
-            for (int j = 0; j < GGML_MAX_SRC; j++) {
-                struct ggml_tensor * parent = node->src[j];
-                if (parent == NULL) {
-                    break;
-                }
-                hash_get(ht, parent)->n_children += 1;
-                if (ggml_is_view(parent) && parent->buffer == NULL && parent->data != NULL) {
-                    init_view(alloc, parent);
-                }
-            }
-        }
-    }
-
-    // allocate tensors
-    for (int g = 0; g < n_graphs; g++) {
-        struct ggml_cgraph * gf = graphs[g];
-        AT_PRINTF("####### graph %d/%d\n", g, n_graphs);
-        // graph inputs are allocated first to ensure that they are not overwritten by each other
-        if (inputs != NULL && inputs[g] != NULL) {
-            for (int i = 0; inputs[g][i] != NULL; i++) {
-                struct ggml_tensor * input = inputs[g][i];
-                AT_PRINTF("input: %s\n", input->name);
-                allocate_node(alloc, input);
-            }
-        }
-        // if we have parse_seq then we allocate nodes following the list, and we only free nodes at barriers
-        int last_barrier_pos = 0;
-        int n_nodes = alloc->parse_seq_len ? alloc->parse_seq_len : gf->n_nodes;
-
-        for (int ind = 0; ind < n_nodes; ind++) {
-            // allocate a node if there is no parse_seq or this is not a barrier
-            if ((alloc->parse_seq_len==0) || alloc->parse_seq[ind] != -1) {
-                int i = alloc->parse_seq_len ? alloc->parse_seq[ind] : ind;
-                struct ggml_tensor * node = gf->nodes[i];
-
-                // allocate parents (leafs)
-                for (int j = 0; j < GGML_MAX_SRC; j++) {
-                    struct ggml_tensor * parent = node->src[j];
-                    if (parent == NULL) {
-                        break;
-                    }
-                    allocate_node(alloc, parent);
-                }
-
-                // allocate node
-                allocate_node(alloc, node);
-
-                AT_PRINTF("exec: %s (%s) <= ", ggml_op_name(node->op), node->name);
-                for (int j = 0; j < GGML_MAX_SRC; j++) {
-                    struct ggml_tensor * parent = node->src[j];
-                    if (parent == NULL) {
-                        break;
-                    }
-                    AT_PRINTF("%s", parent->name);
-                    if (j < GGML_MAX_SRC - 1 && node->src[j + 1] != NULL) {
-                        AT_PRINTF(", ");
-                    }
-                }
-                AT_PRINTF("\n");
-            }
-
-            // update parents
-            // update immediately if there is no parse_seq
-            // update only at barriers if there is parse_seq
-            if ((alloc->parse_seq_len == 0) || alloc->parse_seq[ind] == -1) {
-                int update_start = alloc->parse_seq_len ? last_barrier_pos : ind;
-                int update_end   = alloc->parse_seq_len ? ind              : ind + 1;
-                for (int i = update_start; i < update_end; i++) {
-                    int node_i = alloc->parse_seq_len ? alloc->parse_seq[i] : i;
-                    struct ggml_tensor * node = gf->nodes[node_i];
-
-                    for (int j = 0; j < GGML_MAX_SRC; j++) {
-                        struct ggml_tensor * parent = node->src[j];
-                        if (parent == NULL) {
-                            break;
-                        }
-                        struct hash_node * p_hn = hash_get(ht, parent);
-                        p_hn->n_children -= 1;
-
-                        //AT_PRINTF("parent %s: %d children, %d views\n", parent->name, parent->n_children, parent->n_views);
-
-                        if (p_hn->n_children == 0 && p_hn->n_views == 0) {
-                            if (ggml_is_view(parent)) {
-                                struct ggml_tensor * view_src = parent->view_src;
-                                struct hash_node * view_src_hn = hash_get(ht, view_src);
-                                view_src_hn->n_views -= 1;
-                                AT_PRINTF("view_src %s: %d children, %d views\n", view_src->name, view_src_hn->n_children, view_src_hn->n_views);
-                                if (view_src_hn->n_views == 0 && view_src_hn->n_children == 0 && view_src->data != node->data) {
-                                    ggml_allocr_free_tensor(alloc, view_src);
-                                }
-                            }
-                            else {
-                                if (parent->data != node->data) {
-                                    ggml_allocr_free_tensor(alloc, parent);
-                                }
-                            }
-                        }
-                    }
-                }
-                AT_PRINTF("\n");
-                if (alloc->parse_seq_len) {
-                    last_barrier_pos = ind + 1;
-                }
-            }
-        }
-        // free graph outputs here that wouldn't be freed otherwise because they have no children
-        if (outputs != NULL && outputs[g] != NULL) {
-            for (int i = 0; outputs[g][i] != NULL; i++) {
-                struct ggml_tensor * output = outputs[g][i];
-                AT_PRINTF("output: %s\n", output->name);
-                ggml_allocr_free_tensor(alloc, output);
-            }
-        }
-    }
-
-    return alloc->max_size;
-}
-
-size_t ggml_allocr_alloc_graph(struct ggml_allocr * alloc, struct ggml_cgraph * graph) {
-    return ggml_allocr_alloc_graph_n(alloc, &graph, 1, NULL, NULL);
-}
-
-size_t ggml_allocr_max_size(struct ggml_allocr * alloc) {
-    return alloc->max_size;
-}
--- a/runner/ggml-alloc.h
+++ b/runner/ggml-alloc.h
@@ -1,59 +0,0 @@
-/**
- * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
- *
- * MIT License
- *
- * Copyright (c) 2023 Georgi Gerganov
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#pragma once
-
-#include "ggml.h"
-
-#ifdef  __cplusplus
-extern "C" {
-#endif
-
-struct ggml_backend_buffer;
-
-GGML_API struct ggml_allocr * ggml_allocr_new(void * data, size_t size, size_t alignment);
-GGML_API struct ggml_allocr * ggml_allocr_new_measure(size_t alignment);
-GGML_API struct ggml_allocr * ggml_allocr_new_from_buffer(struct ggml_backend_buffer * buffer);
-
-// tell the allocator to parse nodes following the order described in the list
-// you should call this if your graph are optimized to execute out-of-order
-GGML_API void   ggml_allocr_set_parse_seq(struct ggml_allocr * alloc, const int * list, int n);
-
-GGML_API void   ggml_allocr_free       (struct ggml_allocr * alloc);
-GGML_API bool   ggml_allocr_is_measure (struct ggml_allocr * alloc);
-GGML_API void   ggml_allocr_reset      (struct ggml_allocr * alloc);
-GGML_API void   ggml_allocr_alloc      (struct ggml_allocr * alloc, struct ggml_tensor * tensor);
-GGML_API size_t ggml_allocr_alloc_graph(struct ggml_allocr * alloc, struct ggml_cgraph * graph);
-GGML_API size_t ggml_allocr_max_size   (struct ggml_allocr * alloc);
-
-GGML_API size_t ggml_allocr_alloc_graph_n(
-                    struct ggml_allocr * alloc,
-                    struct ggml_cgraph ** graphs, int n_graphs,
-                    struct ggml_tensor *** inputs, struct ggml_tensor *** outputs);
-
-#ifdef  __cplusplus
-}
-#endif
--- a/runner/ggml-backend.c
+++ b/runner/ggml-backend.c
@@ -1,411 +0,0 @@
-/**
- * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
- *
- * MIT License
- *
- * Copyright (c) 2023 Georgi Gerganov
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#include "ggml-backend.h"
-#include "ggml-alloc.h"
-
-#include <assert.h>
-#include <stdarg.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-
-#define UNUSED GGML_UNUSED
-
-#define MAX(a, b) ((a) > (b) ? (a) : (b))
-
-// backend buffer
-
-ggml_backend_buffer_t ggml_backend_buffer_init(
-        struct ggml_backend                  * backend,
-        struct ggml_backend_buffer_i           iface,
-               ggml_backend_buffer_context_t   context,
-               size_t                          size) {
-    ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));
-
-    GGML_ASSERT(iface.get_base != NULL);
-
-    (*buffer) = (struct ggml_backend_buffer) {
-        /* .interface = */ iface,
-        /* .backend   = */ backend,
-        /* .context   = */ context,
-        /* .size      = */ size,
-    };
-
-    return buffer;
-}
-
-void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
-    if (buffer->iface.free_buffer != NULL) {
-        buffer->iface.free_buffer(buffer);
-    }
-    free(buffer);
-}
-
-size_t ggml_backend_buffer_get_alignment(ggml_backend_buffer_t buffer) {
-    return ggml_backend_get_alignment(buffer->backend);
-}
-
-void * ggml_backend_buffer_get_base(ggml_backend_buffer_t buffer) {
-    return buffer->iface.get_base(buffer);
-}
-
-size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer) {
-    return buffer->size;
-}
-
-size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
-    if (buffer->iface.get_alloc_size) {
-        return buffer->iface.get_alloc_size(buffer, tensor);
-    }
-    return ggml_nbytes(tensor);
-}
-
-void ggml_backend_buffer_init_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
-    if (buffer->iface.init_tensor) {
-        buffer->iface.init_tensor(buffer, tensor);
-    }
-}
-
-void ggml_backend_buffer_free_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
-    if (buffer->iface.free_tensor) {
-        buffer->iface.free_tensor(buffer, tensor);
-    }
-}
-
-// backend
-
-ggml_backend_t ggml_get_backend(const struct ggml_tensor * tensor) {
-    return tensor->buffer->backend;
-}
-
-const char * ggml_backend_name(ggml_backend_t backend) {
-    return backend->iface.get_name(backend);
-}
-
-void ggml_backend_free(ggml_backend_t backend) {
-    backend->iface.free(backend);
-}
-
-ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size) {
-    return backend->iface.alloc_buffer(backend, size);
-}
-
-size_t ggml_backend_get_alignment(ggml_backend_t backend) {
-    return backend->iface.get_alignment(backend);
-}
-
-void ggml_backend_tensor_set_async(struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
-    ggml_get_backend(tensor)->iface.set_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
-}
-
-void ggml_backend_tensor_get_async(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
-    ggml_get_backend(tensor)->iface.get_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
-}
-
-void ggml_backend_tensor_set(struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
-    ggml_get_backend(tensor)->iface.set_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
-    ggml_get_backend(tensor)->iface.synchronize(ggml_get_backend(tensor));
-}
-
-void ggml_backend_tensor_get(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
-    ggml_get_backend(tensor)->iface.get_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
-    ggml_get_backend(tensor)->iface.synchronize(ggml_get_backend(tensor));
-}
-
-void ggml_backend_synchronize(ggml_backend_t backend) {
-    backend->iface.synchronize(backend);
-}
-
-ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
-    return backend->iface.graph_plan_create(backend, cgraph);
-}
-
-void ggml_backend_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
-    backend->iface.graph_plan_free(backend, plan);
-}
-
-void ggml_backend_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
-    backend->iface.graph_plan_compute(backend, plan);
-}
-
-void ggml_backend_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
-    backend->iface.graph_compute(backend, cgraph);
-}
-
-bool ggml_backend_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {
-    return backend->iface.supports_op(backend, op);
-}
-
-// backend copy
-
-static bool ggml_are_same_layout(const struct ggml_tensor * a, const struct ggml_tensor * b) {
-    if (a->type != b->type) {
-        return false;
-    }
-    for (int i = 0; i < GGML_MAX_DIMS; i++) {
-        if (a->ne[i] != b->ne[i]) {
-            return false;
-        }
-        if (a->nb[i] != b->nb[i]) {
-            return false;
-        }
-    }
-    return true;
-}
-
-void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst) {
-    //printf("src: %s ne: [%d %d %d %d] nb: [%d %d %d %d]\n", src->name, (int)src->ne[0], (int)src->ne[1], (int)src->ne[2], (int)src->ne[3], (int)src->nb[0], (int)src->nb[1], (int)src->nb[2], (int)src->nb[3]);
-    //printf("dst: %s ne: [%d %d %d %d] nb: [%d %d %d %d]\n", dst->name, (int)dst->ne[0], (int)dst->ne[1], (int)dst->ne[2], (int)dst->ne[3], (int)dst->nb[0], (int)dst->nb[1], (int)dst->nb[2], (int)dst->nb[3]);
-    GGML_ASSERT(ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts");
-
-    // printf("cpy tensor %s from %s to %s (%lu bytes)\n", src->name, ggml_backend_name(src->backend), ggml_backend_name(dst->backend), ggml_nbytes(src));
-
-    if (src == dst) {
-        return;
-    }
-
-    // TODO: allow backends to support copy to/from same backend
-
-    if (ggml_get_backend(dst)->iface.cpy_tensor_from != NULL) {
-        ggml_get_backend(dst)->iface.cpy_tensor_from(ggml_get_backend(dst)->context, src, dst);
-    } else if (ggml_get_backend(src)->iface.cpy_tensor_to != NULL) {
-        ggml_get_backend(src)->iface.cpy_tensor_to(ggml_get_backend(src)->context, src, dst);
-    } else {
-        // shouldn't be hit when copying from/to CPU
-        #ifndef NDEBUG
-        fprintf(stderr, "ggml_backend_tensor_copy: neither cpy_tensor_from nor cpy_tensor_to are implemented for backends %s and %s, falling back to get/set\n", ggml_backend_name(src->buffer->backend), ggml_backend_name(dst->buffer->backend));
-        #endif
-        size_t nbytes = ggml_nbytes(src);
-        void * data = malloc(nbytes);
-        ggml_backend_tensor_get(src, data, 0, nbytes);
-        ggml_backend_tensor_set(dst, data, 0, nbytes);
-        free(data);
-    }
-}
-
-// backend CPU
-
-struct ggml_backend_cpu_context {
-    int n_threads;
-    void * work_data;
-    size_t work_size;
-};
-
-static const char * ggml_backend_cpu_name(ggml_backend_t backend) {
-    return "CPU";
-
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_free(ggml_backend_t backend) {
-    struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
-    free(cpu_ctx->work_data);
-    free(cpu_ctx);
-    free(backend);
-}
-
-static void * ggml_backend_cpu_buffer_get_base(ggml_backend_buffer_t buffer) {
-    return (void *)buffer->context;
-}
-
-static void ggml_backend_cpu_buffer_free_buffer(ggml_backend_buffer_t buffer) {
-    free(buffer->context);
-    UNUSED(buffer);
-}
-
-static struct ggml_backend_buffer_i cpu_backend_buffer_i = {
-    /* .free_buffer    = */ ggml_backend_cpu_buffer_free_buffer,
-    /* .get_base       = */ ggml_backend_cpu_buffer_get_base,
-    /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
-    /* .init_tensor    = */ NULL, // no initialization required
-    /* .free_tensor    = */ NULL, // no cleanup required
-};
-
-// for buffers from ptr, free is not called
-static struct ggml_backend_buffer_i cpu_backend_buffer_i_from_ptr = {
-    /* .free_buffer    = */ NULL, // ptr is not owned by the buffer, so it does not need to be freed
-    /* .get_base       = */ ggml_backend_cpu_buffer_get_base,
-    /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
-    /* .init_tensor    = */ NULL,
-    /* .free_tensor    = */ NULL,
-};
-
-static const size_t TENSOR_ALIGNMENT = 64; // should be enough for AVX 512
-
-static ggml_backend_buffer_t ggml_backend_cpu_alloc_buffer(ggml_backend_t backend, size_t size) {
-    size += TENSOR_ALIGNMENT;   // malloc may return an address that is not aligned
-    void * data = malloc(size); // TODO: maybe use GGML_ALIGNED_MALLOC?
-
-    return ggml_backend_buffer_init(backend, cpu_backend_buffer_i, data, size);
-}
-
-static size_t ggml_backend_cpu_get_alignment(ggml_backend_t backend) {
-    return TENSOR_ALIGNMENT;
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
-    GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
-    GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
-
-    memcpy((char *)tensor->data + offset, data, size);
-
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_get_tensor_async(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
-    GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor read out of bounds");
-    GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
-
-    memcpy(data, (const char *)tensor->data + offset, size);
-
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_synchronize(ggml_backend_t backend) {
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_cpy_tensor_from(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst) {
-    ggml_backend_tensor_get(src, dst->data, 0, ggml_nbytes(src));
-
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_cpy_tensor_to(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst) {
-    // for a backend such as CUDA that can queue async calls, it is ok to do this asynchronously, but it may not be the case for other backends
-    ggml_backend_tensor_set_async(dst, src->data, 0, ggml_nbytes(src));
-
-    UNUSED(backend);
-}
-
-struct ggml_backend_plan_cpu {
-    struct ggml_cplan cplan;
-    struct ggml_cgraph cgraph;
-};
-
-static ggml_backend_graph_plan_t ggml_backend_cpu_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
-    struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
-
-    struct ggml_backend_plan_cpu * cpu_plan = malloc(sizeof(struct ggml_backend_plan_cpu));
-
-    cpu_plan->cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
-    cpu_plan->cgraph = *cgraph;
-
-    if (cpu_plan->cplan.work_size > 0) {
-        cpu_plan->cplan.work_data = malloc(cpu_plan->cplan.work_size);
-    }
-
-    return cpu_plan;
-}
-
-static void ggml_backend_cpu_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
-    struct ggml_backend_plan_cpu * cpu_plan = (struct ggml_backend_plan_cpu *)plan;
-
-    free(cpu_plan->cplan.work_data);
-    free(cpu_plan);
-
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
-    struct ggml_backend_plan_cpu * cpu_plan = (struct ggml_backend_plan_cpu *)plan;
-
-    ggml_graph_compute(&cpu_plan->cgraph, &cpu_plan->cplan);
-
-    UNUSED(backend);
-}
-
-static void ggml_backend_cpu_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
-    struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
-
-    struct ggml_cplan cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
-
-    if (cpu_ctx->work_size < cplan.work_size) {
-        // TODO: may be faster to free and use malloc to avoid the copy
-        cpu_ctx->work_data = realloc(cpu_ctx->work_data, cplan.work_size);
-        cpu_ctx->work_size = cplan.work_size;
-    }
-
-    cplan.work_data = cpu_ctx->work_data;
-
-    ggml_graph_compute(cgraph, &cplan);
-}
-
-static bool ggml_backend_cpu_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {
-    return true;
-    UNUSED(backend);
-    UNUSED(op);
-}
-
-static struct ggml_backend_i cpu_backend_i = {
-    /* .get_name            = */ ggml_backend_cpu_name,
-    /* .free                = */ ggml_backend_cpu_free,
-    /* .alloc_buffer        = */ ggml_backend_cpu_alloc_buffer,
-    /* .get_alignment       = */ ggml_backend_cpu_get_alignment,
-    /* .set_tensor_async    = */ ggml_backend_cpu_set_tensor_async,
-    /* .get_tensor_async    = */ ggml_backend_cpu_get_tensor_async,
-    /* .synchronize         = */ ggml_backend_cpu_synchronize,
-    /* .cpy_tensor_from     = */ ggml_backend_cpu_cpy_tensor_from,
-    /* .cpy_tensor_to       = */ ggml_backend_cpu_cpy_tensor_to,
-    /* .graph_plan_create   = */ ggml_backend_cpu_graph_plan_create,
-    /* .graph_plan_free     = */ ggml_backend_cpu_graph_plan_free,
-    /* .graph_plan_compute  = */ ggml_backend_cpu_graph_plan_compute,
-    /* .graph_compute       = */ ggml_backend_cpu_graph_compute,
-    /* .supports_op         = */ ggml_backend_cpu_supports_op,
-};
-
-ggml_backend_t ggml_backend_cpu_init(void) {
-    struct ggml_backend_cpu_context * ctx = malloc(sizeof(struct ggml_backend_cpu_context));
-
-    ctx->n_threads = GGML_DEFAULT_N_THREADS;
-    ctx->work_data = NULL;
-    ctx->work_size = 0;
-
-    ggml_backend_t cpu_backend = malloc(sizeof(struct ggml_backend));
-
-    *cpu_backend = (struct ggml_backend) {
-        /* .interface = */ cpu_backend_i,
-        /* .context   = */ ctx
-    };
-    return cpu_backend;
-}
-
-bool ggml_backend_is_cpu(ggml_backend_t backend) {
-    return backend->iface.get_name == ggml_backend_cpu_name;
-}
-
-void ggml_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads) {
-    GGML_ASSERT(ggml_backend_is_cpu(backend_cpu));
-
-    struct ggml_backend_cpu_context * ctx = (struct ggml_backend_cpu_context *)backend_cpu->context;
-    ctx->n_threads = n_threads;
-}
-
-ggml_backend_buffer_t ggml_backend_cpu_buffer_from_ptr(ggml_backend_t backend_cpu, void * ptr, size_t size) {
-    return ggml_backend_buffer_init(backend_cpu, cpu_backend_buffer_i_from_ptr, ptr, size);
-}
--- a/runner/ggml-backend.h
+++ b/runner/ggml-backend.h
@@ -1,169 +0,0 @@
-/**
- * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
- *
- * MIT License
- *
- * Copyright (c) 2023 Georgi Gerganov
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#pragma once
-
-#include "ggml.h"
-
-#ifdef  __cplusplus
-extern "C" {
-#endif
-    struct ggml_backend;
-    struct ggml_backend_buffer;
-
-    // type-erased backend-specific types / wrappers
-    typedef void * ggml_backend_context_t;
-    typedef void * ggml_backend_graph_plan_t;
-    typedef void * ggml_backend_buffer_context_t;
-
-    // avoid accessing internals of these types
-    typedef struct ggml_backend        * ggml_backend_t;
-    typedef struct ggml_backend_buffer * ggml_backend_buffer_t;
-
-    //
-    // backend buffer
-    //
-
-    struct ggml_backend_buffer_i {
-        void   (*free_buffer)   (ggml_backend_buffer_t buffer);
-        void * (*get_base)      (ggml_backend_buffer_t buffer); // get base pointer
-        size_t (*get_alloc_size)(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // pre-allocation callback
-        void   (*init_tensor)   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // post-allocation callback
-        void   (*free_tensor)   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // pre-free callback
-    };
-
-    // TODO: hide behind API
-    struct ggml_backend_buffer {
-        struct ggml_backend_buffer_i iface;
-
-        ggml_backend_t                backend;
-        ggml_backend_buffer_context_t context;
-
-        size_t size;
-    };
-
-    // backend buffer functions
-    GGML_API ggml_backend_buffer_t ggml_backend_buffer_init(
-            struct ggml_backend                  * backend,
-            struct ggml_backend_buffer_i           iface,
-                   ggml_backend_buffer_context_t   context,
-                   size_t                          size);
-
-    GGML_API void   ggml_backend_buffer_free          (ggml_backend_buffer_t buffer);
-    GGML_API size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer);
-    GGML_API void * ggml_backend_buffer_get_base      (ggml_backend_buffer_t buffer);
-    GGML_API size_t ggml_backend_buffer_get_size      (ggml_backend_buffer_t buffer);
-    GGML_API size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
-    GGML_API void   ggml_backend_buffer_init_tensor   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
-    GGML_API void   ggml_backend_buffer_free_tensor   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
-
-    //
-    // backend
-    //
-
-    struct ggml_backend_i {
-        const char * (*get_name)(ggml_backend_t backend);
-
-        void (*free)(ggml_backend_t backend);
-
-        // buffer allocation
-        ggml_backend_buffer_t (*alloc_buffer)(ggml_backend_t backend, size_t size);
-
-        // get buffer alignment
-        size_t (*get_alignment)(ggml_backend_t backend);
-
-        // tensor data access
-        // these functions can be asynchronous, helper functions are provided for synchronous access that automatically call synchronize
-        void (*set_tensor_async)(ggml_backend_t backend,       struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
-        void (*get_tensor_async)(ggml_backend_t backend, const struct ggml_tensor * tensor,       void * data, size_t offset, size_t size);
-        void (*synchronize)     (ggml_backend_t backend);
-
-        // (optional) copy tensor between different backends, allow for single-copy tranfers
-        void (*cpy_tensor_from)(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst);
-        void (*cpy_tensor_to)  (ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst);
-
-        // compute graph with a plan
-        ggml_backend_graph_plan_t (*graph_plan_create) (ggml_backend_t backend, struct ggml_cgraph * cgraph);
-        void                      (*graph_plan_free)   (ggml_backend_t backend, ggml_backend_graph_plan_t plan);
-        void                      (*graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan);
-
-        // compute graph without a plan
-        void (*graph_compute)(ggml_backend_t backend, struct ggml_cgraph * cgraph);
-
-        // check if the backend supports an operation
-        bool (*supports_op)(ggml_backend_t backend, const struct ggml_tensor * op);
-    };
-
-    // TODO: hide behind API
-    struct ggml_backend {
-        struct ggml_backend_i iface;
-
-        ggml_backend_context_t context;
-    };
-
-    // backend helper functions
-    GGML_API ggml_backend_t ggml_get_backend(const struct ggml_tensor * tensor);
-
-    GGML_API const char * ggml_backend_name(ggml_backend_t backend);
-    GGML_API void         ggml_backend_free(ggml_backend_t backend);
-
-    GGML_API ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size);
-
-    GGML_API size_t ggml_backend_get_alignment(ggml_backend_t backend);
-
-    GGML_API void ggml_backend_tensor_set_async(      struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
-    GGML_API void ggml_backend_tensor_get_async(const struct ggml_tensor * tensor,       void * data, size_t offset, size_t size);
-
-    GGML_API void ggml_backend_tensor_set(      struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
-    GGML_API void ggml_backend_tensor_get(const struct ggml_tensor * tensor,       void * data, size_t offset, size_t size);
-
-    GGML_API void ggml_backend_synchronize(ggml_backend_t backend);
-
-    GGML_API ggml_backend_graph_plan_t ggml_backend_graph_plan_create (ggml_backend_t backend, struct ggml_cgraph * cgraph);
-
-    GGML_API void ggml_backend_graph_plan_free   (ggml_backend_t backend, ggml_backend_graph_plan_t plan);
-    GGML_API void ggml_backend_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan);
-    GGML_API void ggml_backend_graph_compute     (ggml_backend_t backend, struct ggml_cgraph * cgraph);
-    GGML_API bool ggml_backend_supports_op       (ggml_backend_t backend, const struct ggml_tensor * op);
-
-    // tensor copy between different backends
-    GGML_API void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst);
-
-    //
-    // CPU backend
-    //
-
-    GGML_API ggml_backend_t ggml_backend_cpu_init(void);
-
-    GGML_API bool ggml_backend_is_cpu(ggml_backend_t backend);
-
-    GGML_API void ggml_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
-
-    GGML_API ggml_backend_buffer_t ggml_backend_cpu_buffer_from_ptr(ggml_backend_t backend_cpu, void * ptr, size_t size);
-
-#ifdef  __cplusplus
-}
-#endif
--- a/runner/ggml-cuda.cu
+++ b/runner/ggml-cuda.cu
--- a/runner/ggml-cuda.h
+++ b/runner/ggml-cuda.h
@@ -1,77 +0,0 @@
-/**
- * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
- *
- * MIT License
- *
- * Copyright (c) 2023 Georgi Gerganov
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#pragma once
-
-#include "ggml.h"
-#include "ggml-backend.h"
-
-#ifdef GGML_USE_HIPBLAS
-#define GGML_CUDA_NAME "ROCm"
-#define GGML_CUBLAS_NAME "hipBLAS"
-#else
-#define GGML_CUDA_NAME "CUDA"
-#define GGML_CUBLAS_NAME "cuBLAS"
-#endif
-
-#ifdef  __cplusplus
-extern "C" {
-#endif
-
-#define GGML_CUDA_MAX_DEVICES       16
-
-GGML_API void   ggml_init_cublas(void);
-GGML_API void * ggml_cuda_host_malloc(size_t size);
-GGML_API void   ggml_cuda_host_free(void * ptr);
-
-GGML_API bool   ggml_cuda_can_mul_mat(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst);
-GGML_API void   ggml_cuda_set_tensor_split(const float * tensor_split);
-GGML_API void   ggml_cuda_transform_tensor(void * data, struct ggml_tensor * tensor);
-GGML_API void   ggml_cuda_free_data(struct ggml_tensor * tensor);
-
-GGML_API void   ggml_cuda_assign_buffers(struct ggml_tensor * tensor);
-GGML_API void   ggml_cuda_assign_buffers_no_scratch(struct ggml_tensor * tensor);
-GGML_API void   ggml_cuda_assign_buffers_force_inplace(struct ggml_tensor * tensor);
-
-GGML_API void   ggml_cuda_assign_buffers_no_alloc(struct ggml_tensor * tensor);
-GGML_API void   ggml_cuda_assign_scratch_offset(struct ggml_tensor * tensor, size_t offset);
-GGML_API void   ggml_cuda_copy_to_device(struct ggml_tensor * tensor);
-
-GGML_API void   ggml_cuda_set_main_device(int main_device);
-GGML_API void   ggml_cuda_set_mul_mat_q(bool mul_mat_q);
-GGML_API void   ggml_cuda_set_scratch_size(size_t scratch_size);
-GGML_API void   ggml_cuda_free_scratch(void);
-GGML_API bool   ggml_cuda_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor);
-
-GGML_API int    ggml_cuda_get_device_count(void);
-GGML_API void   ggml_cuda_get_device_description(int device, char * description, size_t description_size);
-
-// backend API
-GGML_API ggml_backend_t ggml_backend_cuda_init(void); // TODO: take a list of devices to use
-
-#ifdef  __cplusplus
-}
-#endif
--- a/runner/ggml-metal.h
+++ b/runner/ggml-metal.h
@@ -1,134 +0,0 @@
-//go:build darwin
-
-/**
- * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
- *
- * MIT License
- *
- * Copyright (c) 2023 Georgi Gerganov
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-// An interface allowing to compute ggml_cgraph with Metal
-//
-// This is a fully functional interface that extends ggml with GPU support for Apple devices.
-// A similar interface can be created for other GPU backends (e.g. Vulkan, CUDA, OpenCL, etc.)
-//
-// How it works?
-//
-// As long as your program can create and evaluate a ggml_cgraph on the CPU, you can use this
-// interface to evaluate the same graph on the GPU. Instead of using ggml_graph_compute(), you
-// use ggml_metal_graph_compute() (or ggml_vulkan_graph_compute(), etc.)
-//
-// You only need to make sure that all memory buffers that you used during the graph creation
-// are mapped to the device memory with the ggml_metal_add_buffer() function. This mapping is
-// used during the graph evaluation to determine the arguments of the compute kernels.
-//
-// Synchronization between device and host memory (for example for input and output tensors)
-// is done with the ggml_metal_set_tensor() and ggml_metal_get_tensor() functions.
-//
-
-#pragma once
-
-#include "ggml.h"
-#include "ggml-backend.h"
-
-#include <stddef.h>
-#include <stdbool.h>
-
-// max memory buffers that can be mapped to the device
-#define GGML_METAL_MAX_BUFFERS 16
-#define GGML_METAL_MAX_COMMAND_BUFFERS 32
-
-struct ggml_tensor;
-struct ggml_cgraph;
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-//
-// internal API
-// temporary exposed to user-code
-//
-
-struct ggml_metal_context;
-
-void ggml_metal_log_set_callback(ggml_log_callback log_callback, void * user_data);
-
-// number of command buffers to use
-struct ggml_metal_context * ggml_metal_init(int n_cb);
-void ggml_metal_free(struct ggml_metal_context * ctx);
-
-void * ggml_metal_host_malloc(size_t n);
-void   ggml_metal_host_free  (void * data);
-
-// set the number of command buffers to use
-void ggml_metal_set_n_cb(struct ggml_metal_context * ctx, int n_cb);
-
-// creates a mapping between a host memory buffer and a device memory buffer
-// - make sure to map all buffers used in the graph before calling ggml_metal_graph_compute
-// - the mapping is used during computation to determine the arguments of the compute kernels
-// - you don't need to keep the host memory buffer allocated as it is never accessed by Metal
-// - max_size specifies the maximum size of a tensor and is used to create shared views such
-//   that it is guaranteed that the tensor will fit in at least one of the views
-//
-bool ggml_metal_add_buffer(
-        struct ggml_metal_context * ctx,
-                       const char * name,
-                             void * data,
-                           size_t   size,
-                           size_t   max_size);
-
-// set data from host memory into the device
-void ggml_metal_set_tensor(struct ggml_metal_context * ctx, struct ggml_tensor * t);
-
-// get data from the device into host memory
-void ggml_metal_get_tensor(struct ggml_metal_context * ctx, struct ggml_tensor * t);
-
-// try to find operations that can be run concurrently in the graph
-// you should run it again if the topology of your graph changes
-void ggml_metal_graph_find_concurrency(struct ggml_metal_context * ctx, struct ggml_cgraph * gf, bool check_mem);
-
-// if the graph has been optimized for concurrently dispatch, return length of the concur_list if optimized
-int ggml_metal_if_optimized(struct ggml_metal_context * ctx);
-
-// output the concur_list for ggml_alloc
-int * ggml_metal_get_concur_list(struct ggml_metal_context * ctx);
-
-// same as ggml_graph_compute but uses Metal
-// creates gf->n_threads command buffers in parallel
-void ggml_metal_graph_compute(struct ggml_metal_context * ctx, struct ggml_cgraph * gf);
-
-//
-// backend API
-// user-code should use only these functions
-//
-
-GGML_API ggml_backend_t ggml_backend_metal_init(void);
-
-GGML_API bool ggml_backend_is_metal(ggml_backend_t backend);
-
-GGML_API void ggml_backend_metal_set_n_cb(ggml_backend_t backend, int n_cb);
-
-#ifdef __cplusplus
-}
-#endif
-
--- a/runner/ggml-metal.m
+++ b/runner/ggml-metal.m
--- a/runner/ggml-mpi.c
+++ b/runner/ggml-mpi.c
@@ -1,244 +0,0 @@
-//go:build mpi
-
-/**
- * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
- *
- * MIT License
- *
- * Copyright (c) 2023 Georgi Gerganov
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#include "ggml-mpi.h"
-
-#include "ggml.h"
-
-#include <mpi.h>
-
-#include <stdio.h>
-#include <stdlib.h>
-
-#define MIN(a, b) ((a) < (b) ? (a) : (b))
-
-#define UNUSED GGML_UNUSED
-
-struct ggml_mpi_context {
-    int rank;
-    int size;
-};
-
-void ggml_mpi_backend_init(void) {
-    MPI_Init(NULL, NULL);
-}
-
-void ggml_mpi_backend_free(void) {
-    MPI_Finalize();
-}
-
-struct ggml_mpi_context * ggml_mpi_init(void) {
-    struct ggml_mpi_context * ctx = calloc(1, sizeof(struct ggml_mpi_context));
-
-    MPI_Comm_rank(MPI_COMM_WORLD, &ctx->rank);
-    MPI_Comm_size(MPI_COMM_WORLD, &ctx->size);
-
-    return ctx;
-}
-
-void ggml_mpi_free(struct ggml_mpi_context * ctx) {
-    free(ctx);
-}
-
-int ggml_mpi_rank(struct ggml_mpi_context * ctx) {
-    return ctx->rank;
-}
-
-void ggml_mpi_eval_init(
-        struct ggml_mpi_context * ctx_mpi,
-                            int * n_tokens,
-                            int * n_past,
-                            int * n_threads) {
-    UNUSED(ctx_mpi);
-
-    // synchronize the worker node parameters with the root node
-    MPI_Barrier(MPI_COMM_WORLD);
-
-    MPI_Bcast(n_tokens,  1, MPI_INT, 0, MPI_COMM_WORLD);
-    MPI_Bcast(n_past,    1, MPI_INT, 0, MPI_COMM_WORLD);
-    MPI_Bcast(n_threads, 1, MPI_INT, 0, MPI_COMM_WORLD);
-}
-
-static int ggml_graph_get_node_idx(struct ggml_cgraph * gf, const char * name) {
-    struct ggml_tensor * t = ggml_graph_get_tensor(gf, name);
-    if (t == NULL) {
-        fprintf(stderr, "%s: tensor %s not found\n", __func__, name);
-        return -1;
-    }
-
-    for (int i = 0; i < gf->n_nodes; i++) {
-        if (gf->nodes[i] == t) {
-            return i;
-        }
-    }
-
-    fprintf(stderr, "%s: tensor %s not found in graph (should not happen)\n", __func__, name);
-    return -1;
-}
-
-static void ggml_mpi_tensor_send(struct ggml_tensor * t, int mpi_rank_dst) {
-    MPI_Datatype mpi_type;
-
-    switch (t->type) {
-        case GGML_TYPE_I32: mpi_type = MPI_INT32_T; break;
-        case GGML_TYPE_F32: mpi_type = MPI_FLOAT;   break;
-        default: GGML_ASSERT(false && "not implemented");
-    }
-
-    const int retval = MPI_Send(t->data, ggml_nelements(t), mpi_type, mpi_rank_dst, 0, MPI_COMM_WORLD);
-    GGML_ASSERT(retval == MPI_SUCCESS);
-}
-
-static void ggml_mpi_tensor_recv(struct ggml_tensor * t, int mpi_rank_src) {
-    MPI_Datatype mpi_type;
-
-    switch (t->type) {
-        case GGML_TYPE_I32: mpi_type = MPI_INT32_T; break;
-        case GGML_TYPE_F32: mpi_type = MPI_FLOAT;   break;
-        default: GGML_ASSERT(false && "not implemented");
-    }
-
-    MPI_Status status; UNUSED(status);
-
-    const int retval = MPI_Recv(t->data, ggml_nelements(t), mpi_type, mpi_rank_src, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
-    GGML_ASSERT(retval == MPI_SUCCESS);
-}
-
-// TODO: there are many improvements that can be done to this implementation
-void ggml_mpi_graph_compute_pre(
-        struct ggml_mpi_context * ctx_mpi,
-             struct ggml_cgraph * gf,
-                            int   n_layers) {
-    const int mpi_rank = ctx_mpi->rank;
-    const int mpi_size = ctx_mpi->size;
-
-    struct ggml_tensor * inp_tokens = ggml_graph_get_tensor(gf, "inp_tokens");
-    if (inp_tokens == NULL) {
-        fprintf(stderr, "%s: tensor 'inp_tokens' not found\n", __func__);
-        return;
-    }
-
-    struct ggml_tensor * inp0 = ggml_graph_get_tensor(gf, "layer_inp_0");
-    if (inp0 == NULL) {
-        fprintf(stderr, "%s: tensor 'inp0' not found\n", __func__);
-        return;
-    }
-
-    GGML_ASSERT(inp0 == gf->nodes[0]);
-
-    // distribute the compute graph into slices across the MPI nodes
-    //
-    // the main node (0) processes the last layers + the remainder of the compute graph
-    // and is responsible to pass the input tokens to the first node (1)
-    //
-    // node 1:   [(  0) * n_per_node, (  1) * n_per_node)
-    // node 2:   [(  1) * n_per_node, (  2) * n_per_node)
-    // ...
-    // node n-1: [(n-2) * n_per_node, (n-1) * n_per_node)
-    // node 0:   [(n-1) * n_per_node,            n_nodes)
-    //
-    if (mpi_rank > 0) {
-        if (mpi_rank == 1) {
-            // the first node (1) receives the input tokens from the main node (0)
-            ggml_mpi_tensor_recv(inp_tokens, 0);
-        } else {
-            // recv input data for each node into the "inp0" tensor (i.e. the first node in the compute graph)
-            ggml_mpi_tensor_recv(inp0, mpi_rank - 1);
-        }
-    } else if (mpi_size > 1) {
-        // node 0 sends the input tokens to node 1
-        ggml_mpi_tensor_send(inp_tokens, 1);
-
-        // recv the output data from the last node
-        ggml_mpi_tensor_recv(inp0, mpi_size - 1);
-    }
-
-    {
-        const int n_per_node = (n_layers + (mpi_size - 1)) / mpi_size;
-
-        const int mpi_idx = mpi_rank > 0 ? mpi_rank - 1 : mpi_size - 1;
-
-        const int il0 =               (mpi_idx + 0) * n_per_node;
-        const int il1 = MIN(n_layers, (mpi_idx + 1) * n_per_node);
-
-        char name_l0[GGML_MAX_NAME];
-        char name_l1[GGML_MAX_NAME];
-
-        snprintf(name_l0, sizeof(name_l0), "layer_inp_%d", il0);
-        snprintf(name_l1, sizeof(name_l1), "layer_inp_%d", il1);
-
-        const int idx_l0 =                ggml_graph_get_node_idx(gf, name_l0);
-        const int idx_l1 = mpi_rank > 0 ? ggml_graph_get_node_idx(gf, name_l1) + 1 : gf->n_nodes;
-
-        if (idx_l0 < 0 || idx_l1 < 0) {
-            fprintf(stderr, "%s: layer input nodes not found\n", __func__);
-            return;
-        }
-
-        // attach the input data to all nodes that need it
-        // TODO: not great - should be able to do this without modifying the compute graph (see next TODO below)
-        for (int i = idx_l0; i < idx_l1; i++) {
-            if (gf->nodes[i]->src[0] == gf->nodes[idx_l0]) {
-                gf->nodes[i]->src[0] =  inp0;
-            }
-            if (gf->nodes[i]->src[1] == gf->nodes[idx_l0]) {
-                gf->nodes[i]->src[1] =  inp0;
-            }
-        }
-
-        // TODO: instead of rearranging the nodes, we should be able to execute a subset of the compute graph
-        for (int i = 1; i < idx_l1 - idx_l0; i++) {
-            gf->nodes[i] = gf->nodes[idx_l0 + i];
-            gf->grads[i] = gf->grads[idx_l0 + i];
-        }
-
-        // the first node performs the "get_rows" operation, the rest of the nodes get the data from the previous node
-        if (mpi_idx != 0) {
-            gf->nodes[0]->op = GGML_OP_NONE;
-        }
-
-        gf->n_nodes = idx_l1 - idx_l0;
-
-        //fprintf(stderr, "%s: node %d: processing %d nodes [%d, %d)\n", __func__, mpi_rank, gf->n_nodes, il0, il1);
-    }
-}
-
-void ggml_mpi_graph_compute_post(
-        struct ggml_mpi_context * ctx_mpi,
-             struct ggml_cgraph * gf,
-                            int   n_layers) {
-    UNUSED(n_layers);
-
-    const int mpi_rank = ctx_mpi->rank;
-    const int mpi_size = ctx_mpi->size;
-
-    // send the output data to the next node
-    if (mpi_rank > 0) {
-        ggml_mpi_tensor_send(gf->nodes[gf->n_nodes - 1], (mpi_rank + 1) % mpi_size);
-    }
-}
--- a/Show More
+++ b/Show More