working causal attention

changes
uses input prompt
2026-01-14 18:39:19 -05:00 · 2024-08-27 11:34:32 -07:00 · 2024-08-27 10:47:13 -07:00 · 2024-08-23 16:29:59 -07:00 · 2024-08-23 14:12:26 -07:00 · 2024-08-23 13:18:26 -07:00
10 changed files with 237 additions and 179 deletions
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -111,10 +111,7 @@ On Windows, Ollama inherits your user and system environment variables.

 ## How do I use Ollama behind a proxy?

-Ollama pulls models from the Internet and may require a proxy server to access the models. Use `HTTPS_PROXY` to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.
-
-> [!NOTE]
-> Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server.
+Ollama is compatible with proxy servers if `HTTP_PROXY` or `HTTPS_PROXY` are configured. When using either variables, ensure it is set where `ollama serve` can access the values. When using `HTTPS_PROXY`, ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.

 ### How do I use Ollama behind a proxy in Docker?

@@ -279,4 +276,4 @@ Note: Windows with Radeon GPUs currently default to 1 model maximum due to limit

 ## How does Ollama load models on multiple GPUs?

-Installing multiple GPUs of the same brand can be a great way to increase your available VRAM to load larger models.  When you load a new model, Ollama evaluates the required VRAM for the model against what is currently available.  If the model will entirely fit on any single GPU, Ollama will load the model on that GPU.  This typically provides the best performance as it reduces the amount of data transfering across the PCI bus during inference.  If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
+Installing multiple GPUs of the same brand can be a great way to increase your available VRAM to load larger models.  When you load a new model, Ollama evaluates the required VRAM for the model against what is currently available.  If the model will entirely fit on any single GPU, Ollama will load the model on that GPU.  This typically provides the best performance as it reduces the amount of data transfering across the PCI bus during inference.  If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
--- a/docs/images/ollama-keys.png
+++ b/docs/images/ollama-keys.png
--- a/docs/images/signup.png
+++ b/docs/images/signup.png
--- a/docs/import.md
+++ b/docs/import.md
@@ -1,129 +1,44 @@
-# Importing a model
+# Import

-## Table of Contents
+GGUF models and select Safetensors models can be imported directly into Ollama.

-  * [Importing a Safetensors adapter](#Importing-a-fine-tuned-adapter-from-Safetensors-weights)
-  * [Importing a Safetensors model](#Importing-a-model-from-Safetensors-weights)
-  * [Importing a GGUF file](#Importing-a-GGUF-based-model-or-adapter)
-  * [Sharing models on ollama.com](#Sharing-your-model-on-ollamacom)
+## Import GGUF

-## Importing a fine tuned adapter from Safetensors weights
-
-First, create a `Modelfile` with a `FROM` command pointing at the base model you used for fine tuning, and an `ADAPTER` command which points to the directory with your Safetensors adapter:
-
-```dockerfile
-FROM <base model name>
-ADAPTER /path/to/safetensors/adapter/directory
-```
-
-Make sure that you use the same base model in the `FROM` command as you used to create the adapter otherwise you will get erratic results. Most frameworks use different quantization methods, so it's best to use non-quantized (i.e. non-QLoRA) adapters. If your adapter is in the same directory as your `Modelfile`, use `ADAPTER .` to specify the adapter path.
-
-Now run `ollama create` from the directory where the `Modelfile` was created:
-
-```bash
-ollama create my-model
-```
-
-Lastly, test the model:
-
-```bash
-ollama run my-model
-```
-
-Ollama supports importing adapters based on several different model architectures including:
-
-  * Llama (including Llama 2, Llama 3, and Llama 3.1);
-  * Mistral (including Mistral 1, Mistral 2, and Mixtral); and
-  * Gemma (including Gemma 1 and Gemma 2)
-
-You can create the adapter using a fine tuning framework or tool which can output adapters in the Safetensors format, such as:
-
-  * Hugging Face [fine tuning framework] (https://huggingface.co/docs/transformers/en/training)
-  * [Unsloth](https://github.com/unslothai/unsloth)
-  * [MLX](https://github.com/ml-explore/mlx)
-
-
-## Importing a model from Safetensors weights
-
-First, create a `Modelfile` with a `FROM` command which points to the directory containing your Safetensors weights:
-
-```dockerfile
-FROM /path/to/safetensors/directory
-```
-
-If you create the Modelfile in the same directory as the weights, you can use the command `FROM .`.
-
-Now run the `ollama create` command from the directory where you created the `Modelfile`:
-
-```shell
-ollama create my-model
-```
-
-Lastly, test the model:
-
-```shell
-ollama run my-model
-```
-
-Ollama supports importing models for several different architectures including:
-
-  * Llama (including Llama 2, Llama 3, and Llama 3.1);
-  * Mistral (including Mistral 1, Mistral 2, and Mixtral);
-  * Gemma (including Gemma 1 and Gemma 2); and
-  * Phi3
-
-This includes importing foundation models as well as any fine tuned models which which have been _fused_ with a foundation model.
-
-
-## Importing a GGUF based model or adapter
-
-If you have a GGUF based model or adapter it is possible to import it into Ollama. You can obtain a GGUF model or adapter by:
-
-  * converting a Safetensors model with the `convert_hf_to_gguf.py` from Llama.cpp; 
-  * converting a Safetensors adapter with the `convert_lora_to_gguf.py` from Llama.cpp; or
-  * downloading a model or adapter from a place such as HuggingFace
-
-To import a GGUF model, create a `Modelfile` containg:
+A binary GGUF file can be imported directly into Ollama through a Modelfile.

 ```dockerfile
 FROM /path/to/file.gguf
 ```

-For a GGUF adapter, create the `Modelfile` with:
+## Import Safetensors
+
+If the model being imported is one of these architectures, it can be imported directly into Ollama through a Modelfile:
+
+ - LlamaForCausalLM
+ - MistralForCausalLM
+ - MixtralForCausalLM
+ - GemmaForCausalLM
+ - Phi3ForCausalLM

 ```dockerfile
-FROM <model name>
-ADAPTER /path/to/file.gguf
+FROM /path/to/safetensors/directory
 ```

-When importing a GGUF adapter, it's important to use the same base model as the base model that the adapter was created with. You can use:
+For architectures not directly convertable by Ollama, see llama.cpp's [guide](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize) on conversion. After conversion, see [Import GGUF](#import-gguf).

- * a model from Ollama
- * a GGUF file
- * a Safetensors based model 
+## Automatic Quantization

-Once you have created your `Modelfile`, use the `ollama create` command to build the model.
+> [!NOTE]
+> Automatic quantization requires v0.1.35 or higher.

-```shell
-ollama create my-model
-```
-
-## Quantizing a Model
-
-Quantizing a model allows you to run models faster and with less memory consumption but at reduced accuracy. This allows you to run a model on more modest hardware.
-
-Ollama can quantize FP16 and FP32 based models into different quantization levels using the `-q/--quantize` flag with the `ollama create` command.
-
-First, create a Modelfile with the FP16 or FP32 based model you wish to quantize.
+Ollama is capable of quantizing FP16 or FP32 models to any of the supported quantizations with the `-q/--quantize` flag in `ollama create`.

 ```dockerfile
 FROM /path/to/my/gemma/f16/model
 ```

-Use `ollama create` to then create the quantized model.
-
 ```shell
-$ ollama create --quantize q4_K_M mymodel
+$ ollama create -q Q4_K_M mymodel
 transferring model data
 quantizing F16 model to Q4_K_M
 creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd
@@ -134,53 +49,42 @@ success

 ### Supported Quantizations

- `q4_0`
- `q4_1`
- `q5_0`
- `q5_1`
- `q8_0`
+- `Q4_0`
+- `Q4_1`
+- `Q5_0`
+- `Q5_1`
+- `Q8_0`

 #### K-means Quantizations

- `q3_K_S`
- `q3_K_M`
- `q3_K_L`
- `q4_K_S`
- `q4_K_M`
- `q5_K_S`
- `q5_K_M`
- `q6_K`
+- `Q3_K_S`
+- `Q3_K_M`
+- `Q3_K_L`
+- `Q4_K_S`
+- `Q4_K_M`
+- `Q5_K_S`
+- `Q5_K_M`
+- `Q6_K`

+## Template Detection

-## Sharing your model on ollama.com
+> [!NOTE]
+> Template detection requires v0.1.42 or higher.

-You can share any model you have created by pushing it to [ollama.com](https://ollama.com) so that other users can try it out.
+Ollama uses model metadata, specifically `tokenizer.chat_template`, to automatically create a template appropriate for the model you're importing.

-First, use your browser to go to the [Ollama Sign-Up](https://ollama.com/signup) page. If you already have an account, you can skip this step.
-
-![Sign-Up](images/signup.png)
-
-The `Username` field will be used as part of your model's name (e.g. `jmorganca/mymodel`), so make sure you are comfortable with the username that you have selected.
-
-Now that you have created an account and are signed-in, go to the [Ollama Keys Settings](https://ollama.com/settings/keys) page.
-
-Follow the directions on the page to determine where your Ollama Public Key is located.
-
-![Ollama Key](images/ollama-keys.png)
-
-Click on the `Add Ollama Public Key` button, and copy and paste the contents of your Ollama Public Key into the text field.
-
-To push a model to [ollama.com](https://ollama.com), first make sure that it is named correctly with your username. You may have to use the `ollama cp` command to copy
-your model to give it the correct name. Once you're happy with your model's name, use the `ollama push` command to push it to [ollama.com](https://ollama.com).
-
-```shell
-ollama cp mymodel myuser/mymodel
-ollama push myuser/mymodel
+```dockerfile
+FROM /path/to/my/gemma/model
 ```

-Once your model has been pushed, other users can pull and run it by using the command:
-
 ```shell
-ollama run myuser/mymodel
+$ ollama create mymodel
+transferring model data
+using autodetected template gemma-instruct
+creating new layer sha256:baa2a0edc27d19cc6b7537578a9a7ba1a4e3214dc185ed5ae43692b319af7b84
+creating new layer sha256:ba66c3309914dbef07e5149a648fd1877f030d337a4f240d444ea335008943cb
+writing manifest
+success
 ```

+Defining a template in the Modelfile will disable this feature which may be useful if you want to use a different template than the autodetected one.
--- a/gpu/gpu_test.go
+++ b/gpu/gpu_test.go
@@ -32,29 +32,4 @@ func TestCPUMemInfo(t *testing.T) {
 	}
 }

-func TestByLibrary(t *testing.T) {
-	type testCase struct {
-		input  []GpuInfo
-		expect int
-	}
-
-	testCases := map[string]*testCase{
-		"empty":                    {input: []GpuInfo{}, expect: 0},
-		"cpu":                      {input: []GpuInfo{{Library: "cpu"}}, expect: 1},
-		"cpu + GPU":                {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda"}}, expect: 2},
-		"cpu + 2 GPU no variant":   {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda"}, {Library: "cuda"}}, expect: 2},
-		"cpu + 2 GPU same variant": {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}, expect: 2},
-		"cpu + 2 GPU diff variant": {input: []GpuInfo{{Library: "cpu"}, {Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}, expect: 3},
-	}
-
-	for k, v := range testCases {
-		t.Run(k, func(t *testing.T) {
-			resp := (GpuInfoList)(v.input).ByLibrary()
-			if len(resp) != v.expect {
-				t.Fatalf("expected length %d, got %d => %+v", v.expect, len(resp), resp)
-			}
-		})
-	}
-}
-
 // TODO - add some logic to figure out card type through other means and actually verify we got back what we expected
--- a/gpu/types.go
+++ b/gpu/types.go
@@ -94,7 +94,7 @@ func (l GpuInfoList) ByLibrary() []GpuInfoList {
 			}
 		}
 		if !found {
-			libs = append(libs, requested)
+			libs = append(libs, info.Library)
 			resp = append(resp, []GpuInfo{info})
 		}
 	}
--- a/llm/ext_server/server.cpp
+++ b/llm/ext_server/server.cpp
@@ -1271,8 +1271,61 @@ struct llama_server_context
        }
    }

-    // for multiple images processing
-    bool ingest_images(server_slot &slot, int n_batch)
+    // 1 image only
+    bool prepare_pali(server_slot &slot, int n_batch)
+    {
+        int n_past = 0;
+        int image_idx = 0;
+        slot_image &img = slot.images[image_idx];
+
+        // rescale image embeddings
+        float *data = img.image_embedding;
+        for (int i = 0; i < 2048 * 256; i++)
+        {
+            data[i] = data[i] / sqrt(2048);
+        }
+        set_image_embeds(ctx, data);
+
+        // generate user_prompt -> this should contain image tokens prepended and a new line appended:
+        // batch.n_tokens += (int)slot.images.size() * llama_n_embd(model);
+        std::vector<llama_token> tokens;
+
+        for (int i = 0; i < (int)slot.images.size() * 256; i++)
+        {
+            tokens.push_back(257152);
+        }
+
+        tokens.push_back(2);
+
+        // move prefix prompt behind image tokens
+        for (int i = 0; i < batch.n_tokens; i++)
+        {
+            tokens.push_back(batch.token[i]);
+        }
+
+        llama_batch_clear(batch);
+        for (int i = 0; i < (int)tokens.size(); ++i)
+        {
+            llama_batch_add(batch, tokens[i], system_tokens.size() + slot.n_past, {slot.id}, true);
+            slot.n_past += 1;
+        }
+
+        // append prefix of next image
+        const auto json_prompt = slot.params.input_suffix;
+
+        std::vector<llama_token> append_tokens = tokenize(json_prompt, false); // has next image
+        append_tokens.push_back(108);
+
+        for (int i = 0; i < (int)append_tokens.size(); ++i)
+        {
+            llama_batch_add(batch, append_tokens[i], system_tokens.size() + slot.n_past, {slot.id}, true);
+            slot.n_past += 1;
+        }
+        llama_set_causal_attn(ctx, false);
+        return true;
+    }
+
+    bool process_llava(server_slot &slot, int n_batch)
    {
        int image_idx = 0;

@@ -1349,6 +1402,21 @@ struct llama_server_context
        return true;
    }

+    // for multiple images processing based on model architecture
+    bool ingest_images(server_slot &slot, int n_batch)
+    {
+        switch (llama_get_architecture(model))
+        {
+        case 0:
+            return process_llava(slot, n_batch);
+        case 25:
+            return prepare_pali(slot, n_batch);
+        default:
+            LOG_TEE("%s : failed to retrieve model architecture\n", __func__);
+            return false;
+        }
+    }
+
    void request_cancel(int task_id)
    {
        task_server task;
@@ -1916,6 +1984,7 @@ struct llama_server_context
            };

            const int ret = llama_decode(ctx, batch_view);
+            llama_set_causal_attn(ctx, true);

            if (ret != 0)
            {
--- a/llm/patches/06-embeddings.diff
+++ b/llm/patches/06-embeddings.diff
@@ -7,7 +7,7 @@ index 1fe2b9f7..a43312a7 100644
 
     // TODO: use a per-batch flag for logits presence instead
 -    const bool has_logits = !cparams.embeddings;
-+    const bool has_logits =  cparams.causal_attn;
+    const bool has_logits =  cparams.causal_attn || lctx.image_embeds;
     const bool has_embd   =  lctx.is_encoding || (cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE));
 
     const size_t logits_size = has_logits ? n_vocab*n_outputs_max : 0;
@@ -36,7 +36,7 @@ index 1fe2b9f7..a43312a7 100644
             GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor");
         }
 +
-+        if (!cparams.causal_attn) {
+        if (!cparams.causal_attn && !has_image_embeds) {
 +            res = nullptr; // do not extract logits when not needed
 +        }
 +
--- a/llm/patches/12-paligemma.diff
+++ b/llm/patches/12-paligemma.diff
@@ -0,0 +1,113 @@
+diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
+index 9c0d351e..019a147c 100644
+--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
+@@ -718,10 +718,12 @@ static ggml_cgraph * clip_image_build_graph(clip_ctx * ctx, const clip_image_f32
+             embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings);
+             embeddings = ggml_add(ctx0, embeddings, model.mm_0_b);
+ 
+-            embeddings = ggml_gelu(ctx0, embeddings);
+-            embeddings = ggml_mul_mat(ctx0, model.mm_2_w, embeddings);
+-            embeddings = ggml_add(ctx0, embeddings, model.mm_2_b);
+-
+            if (model.mm_2_w)
+            {
+                 embeddings = ggml_gelu(ctx0, embeddings);
+                 embeddings = ggml_mul_mat(ctx0, model.mm_2_w, embeddings);
+                 embeddings = ggml_add(ctx0, embeddings, model.mm_2_b);
+             }
+         } else if (ctx->proj_type == PROJECTOR_TYPE_MLP_NORM) {
+             embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings);
+             embeddings = ggml_add(ctx0, embeddings, model.mm_0_b);
+@@ -2102,6 +2104,10 @@ int clip_n_mmproj_embd(const struct clip_ctx * ctx) {
+         return ctx->vision_model.mm_model_peg_0_b->ne[0];
+     }
+     if (ctx->proj_type == PROJECTOR_TYPE_MLP) {
+        if (ctx->vision_model.mm_2_b == nullptr)
+        {
+            return ctx->vision_model.mm_0_b->ne[0];
+        }
+         return ctx->vision_model.mm_2_b->ne[0];
+     }
+     if (ctx->proj_type == PROJECTOR_TYPE_MLP_NORM) {
+diff --git a/include/llama.h b/include/llama.h
+index 6072e76e..4c572a74 100644
+--- a/include/llama.h
+++ b/include/llama.h
+@@ -444,6 +444,12 @@ extern "C" {
+     // Frees all allocated memory
+     LLAMA_API void llama_free(struct llama_context * ctx);
+ 
+    // Sets image embeddings
+    LLAMA_API void set_image_embeds(struct llama_context *ctx, float *data);
+
+    // Get architecture
+    LLAMA_API int llama_get_architecture(struct llama_model *model);
+
+     LLAMA_API int64_t llama_time_us(void);
+ 
+     LLAMA_API size_t llama_max_devices(void);
+diff --git a/src/llama.cpp b/src/llama.cpp
+index d883ed19..322b4b59 100644
+--- a/src/llama.cpp
+++ b/src/llama.cpp
+@@ -2710,6 +2710,8 @@ struct llama_context {
+ 
+     bool logits_all = false;
+ 
+    float *image_embeds = nullptr;
+
+     // embeddings output (2-dimensional array: [n_outputs][n_embd])
+     // populated only when pooling_type == LLAMA_POOLING_TYPE_NONE
+     size_t  embd_size = 0; // capacity (of floats) for embeddings
+@@ -11591,6 +11593,15 @@ struct llm_build_context {
+ 
+         inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);
+ 
+        if (lctx.image_embeds)
+        {
+            struct ggml_tensor *image_embeds = ggml_dup_tensor(ctx0, inpL);
+            image_embeds->data = lctx.image_embeds;
+            image_embeds->ne[1] = 256;
+            inpL = ggml_set_2d_inplace(ctx0, inpL, image_embeds, inpL->nb[1], 0);
+            lctx.image_embeds = NULL;
+        }
+
+         inpL = ggml_scale(ctx0, inpL, sqrtf(n_embd));
+         cb(inpL, "inp_scaled", -1);
+ 
+@@ -14468,6 +14479,7 @@ static int llama_decode_internal(
+ 
+     const int64_t n_embd  = hparams.n_embd;
+     const int64_t n_vocab = hparams.n_vocab;
+    const bool has_image_embeds = lctx.image_embeds;
+ 
+     uint32_t n_outputs = 0;
+     uint32_t n_outputs_prev = 0;
+@@ -14581,7 +14593,8 @@ static int llama_decode_internal(
+         }
+ 
+         // non-causal masks do not use the KV cache
+-        if (hparams.causal_attn) {
+        if (hparams.causal_attn || lctx.image_embeds)
+        {
+             llama_kv_cache_update(&lctx);
+ 
+             // if we have enough unused cells before the current head ->
+@@ -16455,6 +16468,16 @@ void llama_free_model(struct llama_model * model) {
+     delete model;
+ }
+ 
+void set_image_embeds(llama_context *ctx, float *data)
+{
+    ctx->image_embeds = data;
+}
+
+int llama_get_architecture(llama_model *model)
+{
+    return model->arch;
+}
+
+ struct llama_context * llama_new_context_with_model(
+                  struct llama_model * model,
+         struct llama_context_params   params) {
--- a/llm/server.go
+++ b/llm/server.go
@@ -258,7 +258,7 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr
 		params = append(params, "--mlock")
 	}

-	if gpu.IsNUMA() && gpus[0].Library == "cpu" {
+	if gpu.IsNUMA() {
 		numaMode := "distribute"
 		if runtime.GOOS == "linux" {
 			if _, err := exec.LookPath("numactl"); err == nil {
Author	SHA1	Message	Date
Josh Yan	a6d30ecefe	working causal attention	2024-08-27 11:34:32 -07:00
Josh Yan	80eef7c7b1	changes	2024-08-27 10:47:13 -07:00
Josh Yan	a33e56cddb	uses input prompt	2024-08-23 16:29:59 -07:00
Josh Yan	e6802df906	fixed patches, llava	2024-08-23 14:12:26 -07:00
Josh Yan	c631633bce	paligemma demo works	2024-08-23 13:18:26 -07:00
Roy Han	7de230f005	paligemma patch	2024-08-23 13:10:43 -07:00