From d47b985e5d734c86f7dacfee472fd01d882ffbed Mon Sep 17 00:00:00 2001 From: Aaron Pham <29749331+aarnphm@users.noreply.github.com> Date: Wed, 8 Nov 2023 07:40:12 -0500 Subject: [PATCH] docs: update quantization notes (#589) Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com> --- README.md | 60 +++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 45 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 048dc5ed..8a84388a 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ Key features include: 🤖️ **Bring your own LLM**: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (`LLM.tuning()`) is coming soon. -⚡ **Quantization**: Run inference with less computational and memory costs though quantization techniques like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPTQ](https://arxiv.org/abs/2210.17323). +⚡ **Quantization**: Run inference with less computational and memory costs with quantization techniques such as [LLM.int8](https://arxiv.org/abs/2208.07339), [SpQR (int4)](https://arxiv.org/abs/2306.03078), [AWQ](https://arxiv.org/pdf/2306.00978.pdf), [GPTQ](https://arxiv.org/abs/2210.17323), and [SqueezeLLM](https://arxiv.org/pdf/2306.07629v2.pdf). 📡 **Streaming**: Support token streaming through server-sent events (SSE). You can use the `/v1/generate_stream` endpoint for streaming responses from LLMs. @@ -210,7 +210,7 @@ You can specify any of the following Mistral models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -284,7 +284,7 @@ You can specify any of the following Llama models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -375,7 +375,7 @@ You can specify any of the following Dolly-v2 models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -426,7 +426,7 @@ You can specify any of the following Falcon models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -471,7 +471,7 @@ You can specify any of the following Flan-T5 models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -518,7 +518,7 @@ You can specify any of the following GPT-NeoX models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -572,7 +572,7 @@ You can specify any of the following MPT models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -625,7 +625,7 @@ You can specify any of the following OPT models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -675,7 +675,7 @@ You can specify any of the following StableLM models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -724,7 +724,7 @@ You can specify any of the following StarCoder models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -777,7 +777,7 @@ You can specify any of the following Baichuan models by using `--model-id`. ``` > [!NOTE] -> Currently when using the vLLM backend, quantization and adapters are not supported. +> Currently when using the vLLM backend, adapters is yet to be supported. @@ -820,9 +820,20 @@ Note: Quantization is a technique to reduce the storage and computation requirements for machine learning models, particularly during inference. By approximating floating-point numbers as integers (quantized values), quantization allows for faster computations, reduced memory footprint, and can make it feasible to deploy large models on resource-constrained devices. -OpenLLM supports quantization through two methods - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPTQ](https://arxiv.org/abs/2210.17323). +OpenLLM supports the following quantization techniques -To run a model using the `bitsandbytes` method for quantization, you can use the following command: +- [LLM.int8(): 8-bit Matrix Multiplication](https://arxiv.org/abs/2208.07339) through [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) +- [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression +](https://arxiv.org/abs/2306.03078) through [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) +- [AWQ: Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978),  +- [GPTQ: Accurate Post-Training Quantization](https://arxiv.org/abs/2210.17323) +- [SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/abs/2306.07629). + +### PyTorch backend + +With PyTorch backend, OpenLLM supports `int8`, `int4`, `gptq` + +For using int8 and int4 quantization through `bitsandbytes`, you can use the following command: ```bash openllm start opt --quantize int8 @@ -831,7 +842,7 @@ openllm start opt --quantize int8 To run inference with `gptq`, simply pass `--quantize gptq`: ```bash -openllm start falcon --model-id TheBloke/falcon-40b-instruct-GPTQ --quantize gptq --device 0 +openllm start llama --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq ``` > [!NOTE] @@ -839,6 +850,25 @@ openllm start falcon --model-id TheBloke/falcon-40b-instruct-GPTQ --quantize gpt > first to install the dependency. From the GPTQ paper, it is recommended to quantized the weights before serving. > See [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) for more information on GPTQ quantization. +### vLLM backend + +With vLLM backend, OpenLLM supports `awq`, `squeezellm` + +To run inference with `awq`, simply pass `--quantize awq`: + +```bash +openllm start mistral --model-id TheBloke/zephyr-7B-alpha-AWQ --quantize awq +``` + +To run inference with `squeezellm`, simply pass `--quantize squeezellm`: + +```bash +openllm start llama --model-id squeeze-ai-lab/sq-llama-2-7b-w4-s0 --quantize squeezellm --serialization legacy +``` + +> [!IMPORTANT] +> Since both `squeezellm` and `awq` are weight-aware quantization methods, meaning the quantization is done during training, all pre-trained weights needs to get quantized before inference time. Make sure to fine compatible weights on HuggingFace Hub for your model of choice. + ## 🛠️ Serving fine-tuning layers [PEFT](https://huggingface.co/docs/peft/index), or Parameter-Efficient Fine-Tuning, is a methodology designed to fine-tune pre-trained models more efficiently. Instead of adjusting all model parameters, PEFT focuses on tuning only a subset, reducing computational and storage costs. [LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora) (Low-Rank Adaptation) is one of the techniques supported by PEFT. It streamlines fine-tuning by using low-rank decomposition to represent weight updates, thereby drastically reducing the number of trainable parameters.