From d47b985e5d734c86f7dacfee472fd01d882ffbed Mon Sep 17 00:00:00 2001
From: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Date: Wed, 8 Nov 2023 07:40:12 -0500
Subject: [PATCH] docs: update quantization notes (#589)

Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
---
 README.md | 60 +++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 45 insertions(+), 15 deletions(-)
diff --git a/README.md b/README.md
index 048dc5ed..8a84388a 100644
--- a/README.md
+++ b/README.md
@@ -52,7 +52,7 @@ Key features include:
 
 🤖️ **Bring your own LLM**: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (`LLM.tuning()`) is coming soon.
 
-⚡ **Quantization**: Run inference with less computational and memory costs though quantization techniques like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPTQ](https://arxiv.org/abs/2210.17323).
+⚡ **Quantization**: Run inference with less computational and memory costs with quantization techniques such as [LLM.int8](https://arxiv.org/abs/2208.07339), [SpQR (int4)](https://arxiv.org/abs/2306.03078), [AWQ](https://arxiv.org/pdf/2306.00978.pdf), [GPTQ](https://arxiv.org/abs/2210.17323), and [SqueezeLLM](https://arxiv.org/pdf/2306.07629v2.pdf).
 
 📡 **Streaming**: Support token streaming through server-sent events (SSE). You can use the `/v1/generate_stream` endpoint for streaming responses from LLMs.
 
@@ -210,7 +210,7 @@ You can specify any of the following Mistral models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -284,7 +284,7 @@ You can specify any of the following Llama models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -375,7 +375,7 @@ You can specify any of the following Dolly-v2 models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -426,7 +426,7 @@ You can specify any of the following Falcon models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -471,7 +471,7 @@ You can specify any of the following Flan-T5 models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -518,7 +518,7 @@ You can specify any of the following GPT-NeoX models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -572,7 +572,7 @@ You can specify any of the following MPT models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -625,7 +625,7 @@ You can specify any of the following OPT models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -675,7 +675,7 @@ You can specify any of the following StableLM models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -724,7 +724,7 @@ You can specify any of the following StarCoder models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -777,7 +777,7 @@ You can specify any of the following Baichuan models by using `--model-id`.
   ```
 
 > [!NOTE]
-> Currently when using the vLLM backend, quantization and adapters are not supported.
+> Currently when using the vLLM backend, adapters is yet to be supported.
 
 </details>
 
@@ -820,9 +820,20 @@ Note:
 
 Quantization is a technique to reduce the storage and computation requirements for machine learning models, particularly during inference. By approximating floating-point numbers as integers (quantized values), quantization allows for faster computations, reduced memory footprint, and can make it feasible to deploy large models on resource-constrained devices.
 
-OpenLLM supports quantization through two methods - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPTQ](https://arxiv.org/abs/2210.17323).
+OpenLLM supports the following quantization techniques
 
-To run a model using the `bitsandbytes` method for quantization, you can use the following command:
+- [LLM.int8(): 8-bit Matrix Multiplication](https://arxiv.org/abs/2208.07339) through [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
+- [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
+](https://arxiv.org/abs/2306.03078) through [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
+- [AWQ: Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978), 
+- [GPTQ: Accurate Post-Training Quantization](https://arxiv.org/abs/2210.17323)
+- [SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/abs/2306.07629).
+
+### PyTorch backend
+
+With PyTorch backend, OpenLLM supports `int8`, `int4`, `gptq`
+
+For using int8 and int4 quantization through `bitsandbytes`, you can use the following command:
 
 ```bash
 openllm start opt --quantize int8
@@ -831,7 +842,7 @@ openllm start opt --quantize int8
 To run inference with `gptq`, simply pass `--quantize gptq`:
 
 ```bash
-openllm start falcon --model-id TheBloke/falcon-40b-instruct-GPTQ --quantize gptq --device 0
+openllm start llama --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq
 ```
 
 > [!NOTE]
@@ -839,6 +850,25 @@ openllm start falcon --model-id TheBloke/falcon-40b-instruct-GPTQ --quantize gpt
 > first to install the dependency. From the GPTQ paper, it is recommended to quantized the weights before serving.
 > See [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) for more information on GPTQ quantization.
 
+### vLLM backend
+
+With vLLM backend, OpenLLM supports `awq`, `squeezellm`
+
+To run inference with `awq`, simply pass `--quantize awq`:
+
+```bash
+openllm start mistral --model-id TheBloke/zephyr-7B-alpha-AWQ --quantize awq
+```
+
+To run inference with `squeezellm`, simply pass `--quantize squeezellm`:
+
+```bash
+openllm start llama --model-id squeeze-ai-lab/sq-llama-2-7b-w4-s0 --quantize squeezellm --serialization legacy
+```
+
+> [!IMPORTANT]
+> Since both `squeezellm` and `awq` are weight-aware quantization methods, meaning the quantization is done during training, all pre-trained weights needs to get quantized before inference time. Make sure to fine compatible weights on HuggingFace Hub for your model of choice.
+
 ## 🛠️ Serving fine-tuning layers
 
 [PEFT](https://huggingface.co/docs/peft/index), or Parameter-Efficient Fine-Tuning, is a methodology designed to fine-tune pre-trained models more efficiently. Instead of adjusting all model parameters, PEFT focuses on tuning only a subset, reducing computational and storage costs. [LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora) (Low-Rank Adaptation) is one of the techniques supported by PEFT. It streamlines fine-tuning by using low-rank decomposition to represent weight updates, thereby drastically reducing the number of trainable parameters.