mirror of
https://github.com/bentoml/OpenLLM.git
synced 2026-01-21 22:10:45 -05:00
chore(docs): update to obsidian README format
Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
This commit is contained in:
29
README.md
29
README.md
@@ -135,7 +135,8 @@ specify different variants of the model to be served, by providing the
|
||||
openllm start flan-t5 --model-id google/flan-t5-large
|
||||
```
|
||||
|
||||
> **Note** that `openllm` also supports all variants of fine-tuning weights,
|
||||
> [!NOTE]
|
||||
> `openllm` also supports all variants of fine-tuning weights,
|
||||
> custom model path as well as quantized weights for any of the supported models
|
||||
> as long as it can be loaded with the model architecture. Refer to
|
||||
> [supported models](https://github.com/bentoml/OpenLLM/tree/main#-supported-models)
|
||||
@@ -417,7 +418,8 @@ For example, if you want to use the Tensorflow (`tf`) implementation for the
|
||||
OPENLLM_FLAN_T5_FRAMEWORK=tf openllm start flan-t5
|
||||
```
|
||||
|
||||
> **Note** For GPU support on Flax, refers to
|
||||
> [!NOTE]
|
||||
> For GPU support on Flax, refers to
|
||||
> [Jax's installation](https://github.com/google/jax#pip-installation-gpu-cuda-installed-via-pip-easier)
|
||||
> to make sure that you have Jax support for the corresponding CUDA version.
|
||||
|
||||
@@ -437,7 +439,8 @@ To run inference with `gptq`, simply pass `--quantize gptq`:
|
||||
openllm start falcon --model-id TheBloke/falcon-40b-instruct-GPTQ --quantize gptq --device 0
|
||||
```
|
||||
|
||||
> **Note**: to run GPTQ, make sure to install with
|
||||
> [!NOTE]
|
||||
> In order to run GPTQ, make sure to install with
|
||||
> `pip install "openllm[gptq]"`. The weights of all supported models should be
|
||||
> quantized before serving. See
|
||||
> [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) for more
|
||||
@@ -482,7 +485,8 @@ To include this into the Bento, one can also provide a `--adapter-id` into
|
||||
openllm build opt --model-id facebook/opt-6.7b --adapter-id ...
|
||||
```
|
||||
|
||||
> **Note**: We will gradually roll out support for fine-tuning all models. The
|
||||
> [!NOTE]
|
||||
> We will gradually roll out support for fine-tuning all models. The
|
||||
> following models contain fine-tuning support: OPT, Falcon, LlaMA.
|
||||
|
||||
### Integrating a New Model
|
||||
@@ -527,8 +531,8 @@ client = openllm.client.HTTPClient("http://localhost:3000")
|
||||
client.embed("I like to eat apples")
|
||||
```
|
||||
|
||||
> **Note**: Currently, the following model framily supports embeddings: Llama,
|
||||
> T5 (Flan-T5, FastChat, etc.), ChatGLM
|
||||
> [!NOTE]
|
||||
> Currently, the following model framily supports embeddings: Llama, T5 (Flan-T5, FastChat, etc.), ChatGLM
|
||||
|
||||
## ⚙️ Integrations
|
||||
|
||||
@@ -606,7 +610,8 @@ def chat(input_text: str):
|
||||
return agent.run(input_text)
|
||||
```
|
||||
|
||||
> **Note** You can find out more examples under the
|
||||
> [!NOTE]
|
||||
> You can find out more examples under the
|
||||
> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
|
||||
|
||||
### Transformers Agents
|
||||
@@ -614,7 +619,8 @@ def chat(input_text: str):
|
||||
OpenLLM seamlessly integrates with
|
||||
[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).
|
||||
|
||||
> **Warning** The Transformers Agent is still at an experimental stage. It is
|
||||
> [!WARNING]
|
||||
> The Transformers Agent is still at an experimental stage. It is
|
||||
> recommended to install OpenLLM with `pip install -r nightly-requirements.txt`
|
||||
> to get the latest API update for HuggingFace agent.
|
||||
|
||||
@@ -626,7 +632,8 @@ agent = transformers.HfAgent("http://localhost:3000/hf/agent") # URL that runs
|
||||
agent.run("Is the following `text` positive or negative?", text="I don't like how this models is generate inputs")
|
||||
```
|
||||
|
||||
> **Note** Only `starcoder` is currently supported with Agent integration. The
|
||||
> [!IMPORTANT]
|
||||
> Only `starcoder` is currently supported with Agent integration. The
|
||||
> example above was also run with four T4s on EC2 `g4dn.12xlarge`
|
||||
|
||||
If you want to use OpenLLM client to ask questions to the running agent, you can
|
||||
@@ -646,6 +653,7 @@ client.ask_agent(
|
||||
<!-- hatch-fancy-pypi-readme interim stop -->
|
||||
|
||||

|
||||
|
||||
<br/>
|
||||
|
||||
<!-- hatch-fancy-pypi-readme meta start -->
|
||||
@@ -691,7 +699,8 @@ serverless cloud for shipping and scaling AI applications.
|
||||
bentoml cloud login --api-token <your-api-token> --endpoint <bento-cloud-endpoint>
|
||||
```
|
||||
|
||||
> **Note**: Replace `<your-api-token>` and `<bento-cloud-endpoint>` with your
|
||||
> [!NOTE]
|
||||
> Replace `<your-api-token>` and `<bento-cloud-endpoint>` with your
|
||||
> specific API token and the BentoCloud endpoint respectively.
|
||||
|
||||
3. **Bulding a Bento**: With OpenLLM, you can easily build a Bento for a
|
||||
|
||||
@@ -11,7 +11,7 @@ text = """
|
||||
|
||||
"""
|
||||
[[metadata.hooks.fancy-pypi-readme.fragments]]
|
||||
end-before = "\n<!-- hatch-fancy-pypi-readme intro stop -->\n"
|
||||
end-before = "\n<!-- hatch-fancy-pypi-readme intro stop -->"
|
||||
path = "README.md"
|
||||
start-after = "<!-- hatch-fancy-pypi-readme intro start -->\n"
|
||||
[[metadata.hooks.fancy-pypi-readme.fragments]]
|
||||
@@ -22,11 +22,12 @@ text = """
|
||||
</p>
|
||||
"""
|
||||
[[metadata.hooks.fancy-pypi-readme.fragments]]
|
||||
end-before = "\n<!-- hatch-fancy-pypi-readme interim stop -->\n"
|
||||
end-before = "\n<!-- hatch-fancy-pypi-readme interim stop -->"
|
||||
path = "README.md"
|
||||
start-after = "<!-- hatch-fancy-pypi-readme interim start -->\n"
|
||||
[[metadata.hooks.fancy-pypi-readme.fragments]]
|
||||
text = """
|
||||
|
||||
<p align="center">
|
||||
<img src="https://raw.githubusercontent.com/bentoml/openllm/main/assets/agent.gif" alt="Gif showing Agent integration" />
|
||||
</p>
|
||||
@@ -43,6 +44,7 @@ text = """
|
||||
"""
|
||||
[[tool.hatch.metadata.hooks.fancy-pypi-readme.fragments]]
|
||||
path = "CHANGELOG.md"
|
||||
start-after = "<!-- towncrier release notes start -->"
|
||||
pattern = "\n(###.+?\n)## "
|
||||
[[metadata.hooks.fancy-pypi-readme.fragments]]
|
||||
text = """
|
||||
|
||||
@@ -1423,8 +1423,8 @@ class LLMConfig(_ConfigAttr):
|
||||
|
||||
This can be used as a decorator for click commands.
|
||||
|
||||
> **Note**: that the identifier for all LLMConfig will be prefixed with '<model_name>_*', and the generation config
|
||||
will be prefixed with '<model_name>_generation_*'.
|
||||
> [!NOTE]
|
||||
> The identifier for all LLMConfig will be prefixed with '<model_name>_*', and the generation config will be prefixed with '<model_name>_generation_*'.
|
||||
"""
|
||||
for name, field in attr.fields_dict(cls.__openllm_generation_class__).items():
|
||||
ty = cls.__openllm_hints__.get(name)
|
||||
|
||||
@@ -223,7 +223,8 @@ class LLMInterface(ABC, t.Generic[M, T]):
|
||||
|
||||
You can customize how the output of the LLM looks with this hook. By default, it is a simple echo.
|
||||
|
||||
NOTE: this will be used from the client side.
|
||||
> [!NOTE]
|
||||
> This will be used from the client side.
|
||||
"""
|
||||
return generation_result
|
||||
def llm_post_init(self) -> None:
|
||||
@@ -271,7 +272,7 @@ class LLMInterface(ABC, t.Generic[M, T]):
|
||||
- If `self.bettertransformer` is set within `llm_post_init`.
|
||||
- Finally, if none of the above, default to self.config['bettertransformer']
|
||||
|
||||
> **Note** that if LoRA is enabled, bettertransformer will be disabled.
|
||||
> [!NOTE] that if LoRA is enabled, bettertransformer will be disabled.
|
||||
"""
|
||||
device: "torch.device"
|
||||
"""The device to be used for this LLM. If the implementation is 'pt', then it will be torch.device, else string."""
|
||||
@@ -562,7 +563,7 @@ class LLM(LLMInterface[M, T], ReprMixin):
|
||||
|
||||
Args:
|
||||
model_id: The pretrained model to use. Defaults to None. If None, 'self.default_id' will be used.
|
||||
> **Warning**: If custom path is passed, make sure it contains all available file to construct
|
||||
> [!WARNING] If custom path is passed, make sure it contains all available file to construct
|
||||
> ``transformers.PretrainedConfig``, ``transformers.PreTrainedModel``, and ``transformers.PreTrainedTokenizer``.
|
||||
model_name: Optional model name to be saved with this LLM. Default to None. It will be inferred automatically from model_id.
|
||||
If model_id is a custom path, it will be the basename of the given path.
|
||||
@@ -629,7 +630,7 @@ class LLM(LLMInterface[M, T], ReprMixin):
|
||||
If model_id contains the revision itself, then the same format above
|
||||
If model_id is a path, then it will be <framework>-<basename_of_path>:<generated_sha1> if model_version is not passesd, otherwise <framework>-<basename_of_path>:<model_version>
|
||||
|
||||
**Note** here that the generated SHA1 for path cases is that it will be based on last modified time.
|
||||
> [!NOTE] here that the generated SHA1 for path cases is that it will be based on last modified time.
|
||||
|
||||
Args:
|
||||
model_id: Model id for this given LLM. It can be pretrained weights URL, custom path.
|
||||
@@ -664,14 +665,14 @@ class LLM(LLMInterface[M, T], ReprMixin):
|
||||
):
|
||||
"""Initialize the LLM with given pretrained model.
|
||||
|
||||
> **Warning**
|
||||
> [!WARNING]
|
||||
> To initializing any LLM, you should use `openllm.AutoLLM` or `openllm.LLM.from_pretrained` instead.
|
||||
> `__init__` initialization is only for internal use.
|
||||
|
||||
Note:
|
||||
- *args to be passed to the model.
|
||||
- **attrs will first be parsed to the AutoConfig, then the rest will be parsed to the import_model
|
||||
- for tokenizer kwargs, it should be prefixed with _tokenizer_*
|
||||
> [!NOTE]
|
||||
> - *args to be passed to the model.
|
||||
> - **attrs will first be parsed to the AutoConfig, then the rest will be parsed to the import_model
|
||||
> - for tokenizer kwargs, it should be prefixed with _tokenizer_*
|
||||
|
||||
For custom pretrained path, it is recommended to pass in 'model_version' alongside with the path
|
||||
to ensure that it won't be loaded multiple times.
|
||||
@@ -925,10 +926,11 @@ class LLM(LLMInterface[M, T], ReprMixin):
|
||||
Returns:
|
||||
A generated LLMRunner for this LLM.
|
||||
|
||||
> **Note**: There are some difference between bentoml.models.get().to_runner() and LLM.to_runner(): 'name'.
|
||||
- 'name': will be generated by OpenLLM, hence users don't shouldn't worry about this. The generated name will be 'llm-<model-start-name>-runner' (ex: llm-dolly-v2-runner, llm-chatglm-runner)
|
||||
- 'embedded': Will be disabled by default. There is no reason to run LLM in embedded mode.
|
||||
- 'method_configs': The method configs for the runner will be managed internally by OpenLLM.
|
||||
> [!NOTE]: There are some difference between bentoml.models.get().to_runner() and LLM.to_runner():
|
||||
>
|
||||
> - 'name': will be generated by OpenLLM, hence users don't shouldn't worry about this. The generated name will be 'llm-<model-start-name>-runner' (ex: llm-dolly-v2-runner, llm-chatglm-runner)
|
||||
> - 'embedded': Will be disabled by default. There is no reason to run LLM in embedded mode.
|
||||
> - 'method_configs': The method configs for the runner will be managed internally by OpenLLM.
|
||||
"""
|
||||
models = models if models is not None else []
|
||||
|
||||
@@ -946,7 +948,8 @@ class LLM(LLMInterface[M, T], ReprMixin):
|
||||
# NOTE: returning the two langchain API's to the runner
|
||||
return llm_runner_class(self)(
|
||||
llm_runnable_class(self, embeddings_sig, generate_sig, generate_iterator_sig), name=self.runner_name, embedded=False, models=models, max_batch_size=max_batch_size, max_latency_ms=max_latency_ms,
|
||||
method_configs=bentoml_cattr.unstructure({"embeddings": embeddings_sig, "__call__": generate_sig, "generate": generate_sig, "generate_one": generate_sig, "generate_iterator": generate_iterator_sig}), scheduling_strategy=scheduling_strategy,
|
||||
method_configs=bentoml_cattr.unstructure({"embeddings": embeddings_sig, "__call__": generate_sig, "generate": generate_sig, "generate_one": generate_sig, "generate_iterator": generate_iterator_sig}),
|
||||
scheduling_strategy=scheduling_strategy,
|
||||
)
|
||||
|
||||
# NOTE: Scikit API
|
||||
@@ -972,16 +975,11 @@ class LLM(LLMInterface[M, T], ReprMixin):
|
||||
@overload
|
||||
def Runner(model_name: str, *, model_id: str | None = None, model_version: str | None = ..., init_local: t.Literal[False, True] = ..., **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
|
||||
@overload
|
||||
def Runner(
|
||||
model_name: str, *, model_id: str = ..., model_version: str | None = ..., models: list[bentoml.Model] | None = ..., max_batch_size: int | None = ..., max_latency_ms: int | None = ..., method_configs: dict[str, ModelSignatureDict | ModelSignature] | None = ..., embedded: t.Literal[True, False] = ..., scheduling_strategy: type[bentoml.Strategy] | None = ..., **attrs: t.Any
|
||||
) -> LLMRunner[t.Any, t.Any]: ...
|
||||
def Runner(model_name: str, *, model_id: str = ..., model_version: str | None = ..., models: list[bentoml.Model] | None = ..., max_batch_size: int | None = ..., max_latency_ms: int | None = ..., method_configs: dict[str, ModelSignatureDict | ModelSignature] | None = ..., embedded: t.Literal[True, False] = ..., scheduling_strategy: type[bentoml.Strategy] | None = ..., **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
|
||||
@overload
|
||||
def Runner(model_name: str, *, ensure_available: bool | None = None, init_local: bool = ..., implementation: LiteralRuntime | None = None, llm_config: LLMConfig | None = None, **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
|
||||
@overload
|
||||
def Runner(
|
||||
model_name: str, *, model_id: str | None = ..., model_version: str | None = ..., llm_config: LLMConfig | None = ..., runtime: t.Literal["ggml", "transformers"] | None = ..., quantize: t.Literal["int8", "int4", "gptq"] | None = ..., bettertransformer: str | bool | None = ..., adapter_id: str | None = ..., adapter_name: str | None = ...,
|
||||
adapter_map: dict[str, str | None] | None = ..., quantization_config: transformers.BitsAndBytesConfig | autogptq.BaseQuantizeConfig | None = None, serialisation: t.Literal["safetensors", "legacy"] = ..., **attrs: t.Any
|
||||
) -> LLMRunner[t.Any, t.Any]: ...
|
||||
def Runner(model_name: str, *, model_id: str | None = ..., model_version: str | None = ..., llm_config: LLMConfig | None = ..., runtime: t.Literal["ggml", "transformers"] | None = ..., quantize: t.Literal["int8", "int4", "gptq"] | None = ..., bettertransformer: str | bool | None = ..., adapter_id: str | None = ..., adapter_name: str | None = ..., adapter_map: dict[str, str | None] | None = ..., quantization_config: transformers.BitsAndBytesConfig | autogptq.BaseQuantizeConfig | None = None, serialisation: t.Literal["safetensors", "legacy"] = ..., **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
|
||||
# fmt: on
|
||||
|
||||
def Runner(model_name: str, ensure_available: bool | None = None, init_local: bool = False, implementation: LiteralRuntime | None = None, llm_config: LLMConfig | None = None, **attrs: t.Any) -> LLMRunner[t.Any, t.Any]:
|
||||
@@ -1017,7 +1015,8 @@ def Runner(model_name: str, ensure_available: bool | None = None, init_local: bo
|
||||
behaviour
|
||||
"""
|
||||
if llm_config is not None:
|
||||
attrs.update({"model_id": llm_config["env"]["model_id_value"], "bettertransformer": llm_config["env"]["bettertransformer_value"], "quantize": llm_config["env"]["quantize_value"], "runtime": llm_config["env"]["runtime_value"], "serialisation": first_not_none(os.getenv("OPENLLM_SERIALIZATION"), attrs.get("serialisation"), default="safetensors"),})
|
||||
attrs.update({"model_id": llm_config["env"]["model_id_value"], "bettertransformer": llm_config["env"]["bettertransformer_value"], "quantize": llm_config["env"]["quantize_value"], "runtime": llm_config["env"]["runtime_value"],
|
||||
"serialisation": first_not_none(os.getenv("OPENLLM_SERIALIZATION"), attrs.get("serialisation"), default="safetensors")})
|
||||
|
||||
default_implementation = llm_config.default_implementation() if llm_config is not None else "pt"
|
||||
implementation = first_not_none(implementation, default=EnvVarMixin(model_name, default_implementation)["framework_value"])
|
||||
|
||||
@@ -373,12 +373,12 @@ def quantize_option(f: _AnyCallable | None = None, *, build: bool = False, model
|
||||
|
||||
- ``gptq``: ``GPTQ`` [quantization](https://arxiv.org/abs/2210.17323)
|
||||
|
||||
**Note** that the model can also be served with quantized weights.
|
||||
> [!NOTE] that the model can also be served with quantized weights.
|
||||
""" + (
|
||||
"""
|
||||
**Note** that this will set the mode for serving within deployment.""" if build else ""
|
||||
> [!NOTE] that this will set the mode for serving within deployment.""" if build else ""
|
||||
) + """
|
||||
**Note** that quantization are currently only available in *PyTorch* models.""", **attrs)(f)
|
||||
> [!NOTE] that quantization are currently only available in *PyTorch* models.""", **attrs)(f)
|
||||
|
||||
def workers_per_resource_option(f: _AnyCallable | None = None, *, build: bool = False, **attrs: t.Any) -> t.Callable[[FC], FC]:
|
||||
return cli_option(
|
||||
@@ -387,16 +387,16 @@ def workers_per_resource_option(f: _AnyCallable | None = None, *, build: bool =
|
||||
See https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy
|
||||
for more information. By default, this is set to 1.
|
||||
|
||||
**Note**: ``--workers-per-resource`` will also accept the following strategies:
|
||||
> [!NOTE] ``--workers-per-resource`` will also accept the following strategies:
|
||||
|
||||
- ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
|
||||
|
||||
- ``conserved``: This will determine the number of available GPU resources, and only assign one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is equivalent to ``--workers-per-resource 0.25``.
|
||||
""" + (
|
||||
"""\n
|
||||
**Note**: The workers value passed into 'build' will determine how the LLM can
|
||||
be provisioned in Kubernetes as well as in standalone container. This will
|
||||
ensure it has the same effect with 'openllm start --workers ...'""" if build else ""
|
||||
> [!NOTE] The workers value passed into 'build' will determine how the LLM can
|
||||
> be provisioned in Kubernetes as well as in standalone container. This will
|
||||
> ensure it has the same effect with 'openllm start --workers ...'""" if build else ""
|
||||
), **attrs)(f)
|
||||
|
||||
def bettertransformer_option(f: _AnyCallable | None = None, *, build: bool = False, model_env: EnvVarMixin | None = None, **attrs: t.Any) -> t.Callable[[FC], FC]:
|
||||
@@ -416,13 +416,13 @@ def serialisation_option(f: _AnyCallable | None = None, **attrs: t.Any) -> t.Cal
|
||||
``safe_serialization=True``.
|
||||
|
||||
\b
|
||||
**Note** that this format might not work for every cases, and
|
||||
> [!NOTE] that this format might not work for every cases, and
|
||||
you can always fallback to ``legacy`` if needed.
|
||||
|
||||
- ``legacy``: This will use PyTorch serialisation format, often as ``.bin`` files.
|
||||
This should be used if the model doesn't yet support safetensors.
|
||||
|
||||
**Note** that GGML format is working in progress.
|
||||
> [!NOTE] that GGML format is working in progress.
|
||||
""", **attrs)(f)
|
||||
|
||||
def container_registry_option(f: _AnyCallable | None = None, **attrs: t.Any) -> t.Callable[[FC], FC]:
|
||||
@@ -432,7 +432,7 @@ def container_registry_option(f: _AnyCallable | None = None, **attrs: t.Any) ->
|
||||
Currently, it supports 'ecr', 'ghcr.io', 'docker.io'
|
||||
|
||||
\b
|
||||
**Note** that in order to build the base image, you will need a GPUs to compile custom kernel. See ``openllm ext build-base-container`` for more information.
|
||||
> [!NOTE] that in order to build the base image, you will need a GPUs to compile custom kernel. See ``openllm ext build-base-container`` for more information.
|
||||
""")(f)
|
||||
|
||||
_wpr_strategies = {"round_robin", "conserved"}
|
||||
|
||||
@@ -332,17 +332,17 @@ class OpenLLMCommandGroup(BentoMLCommandGroup):
|
||||
def group(self, name: _AnyCallable) -> click.Group:
|
||||
...
|
||||
|
||||
# variant: name omitted, cls _must_ be a keyword argument, @group(cmd=GroupCls, ...)
|
||||
@overload
|
||||
def group(self, name: None = None, *, cls: t.Type[GrpType], **attrs: t.Any) -> t.Callable[[_AnyCallable], GrpType]:
|
||||
...
|
||||
|
||||
# variant: with positional name and with positional or keyword cls argument:
|
||||
# @group(namearg, GroupCls, ...) or @group(namearg, cls=GroupCls, ...)
|
||||
@overload
|
||||
def group(self, name: str | None, cls: type[GrpType], **attrs: t.Any) -> t.Callable[[_AnyCallable], GrpType]:
|
||||
...
|
||||
|
||||
# variant: name omitted, cls _must_ be a keyword argument, @group(cmd=GroupCls, ...)
|
||||
@overload
|
||||
def group(self, name: None = None, *, cls: t.Type[GrpType], **attrs: t.Any) -> t.Callable[[_AnyCallable], GrpType]:
|
||||
...
|
||||
|
||||
# variant: with optional string name, no cls argument provided.
|
||||
@overload
|
||||
def group(self, name: str | None = ..., cls: None = None, **attrs: t.Any) -> t.Callable[[_AnyCallable], click.Group]:
|
||||
@@ -451,7 +451,7 @@ def import_command(model_name: str, model_id: str | None, converter: str | None,
|
||||
$ CONVERTER=llama2-hf openllm import llama /path/to/llama-2
|
||||
```
|
||||
|
||||
> **Note**: This behaviour will override ``--runtime``. Therefore make sure that the LLM contains correct conversion strategies to both GGML and HF.
|
||||
> [!WARNING] This behaviour will override ``--runtime``. Therefore make sure that the LLM contains correct conversion strategies to both GGML and HF.
|
||||
"""
|
||||
llm_config = AutoConfig.for_model(model_name)
|
||||
env = EnvVarMixin(model_name, llm_config.default_implementation(), model_id=model_id, runtime=runtime, quantize=quantize)
|
||||
@@ -484,41 +484,39 @@ def _start(
|
||||
For all additional arguments, pass it as string to ``additional_args``. For example, if you want to
|
||||
pass ``--port 5001``, you can pass ``additional_args=["--port", "5001"]``
|
||||
|
||||
> **Note**: This will create a blocking process, so if you use this API, you can create a running sub thread
|
||||
> [!NOTE] This will create a blocking process, so if you use this API, you can create a running sub thread
|
||||
> to start the server instead of blocking the main thread.
|
||||
|
||||
``openllm.start`` will invoke ``click.Command`` under the hood, so it behaves exactly the same as the CLI interaction.
|
||||
|
||||
> **Note**: ``quantize`` and ``bettertransformer`` are mutually exclusive.
|
||||
> [!NOTE] ``quantize`` and ``bettertransformer`` are mutually exclusive.
|
||||
|
||||
Args:
|
||||
model_name: The model name to start this LLM
|
||||
model_id: Optional model id for this given LLM
|
||||
timeout: The server timeout
|
||||
workers_per_resource: Number of workers per resource assigned.
|
||||
See https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy
|
||||
for more information. By default, this is set to 1.
|
||||
model_name: The model name to start this LLM
|
||||
model_id: Optional model id for this given LLM
|
||||
timeout: The server timeout
|
||||
workers_per_resource: Number of workers per resource assigned.
|
||||
See [resource scheduling](https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy)
|
||||
for more information. By default, this is set to 1.
|
||||
|
||||
> **Note**: ``--workers-per-resource`` will also accept the following strategies:
|
||||
|
||||
> - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
|
||||
|
||||
> - ``conserved``: Thjis will determine the number of available GPU resources, and only assign
|
||||
one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
|
||||
equivalent to ``--workers-per-resource 0.25``.
|
||||
device: Assign GPU devices (if available) to this LLM. By default, this is set to ``None``. It also accepts 'all'
|
||||
argument to assign all available GPUs to this LLM.
|
||||
quantize: Quantize the model weights. This is only applicable for PyTorch models.
|
||||
Possible quantisation strategies:
|
||||
- int8: Quantize the model with 8bit (bitsandbytes required)
|
||||
- int4: Quantize the model with 4bit (bitsandbytes required)
|
||||
- gptq: Quantize the model with GPTQ (auto-gptq required)
|
||||
bettertransformer: Convert given model to FastTransformer with PyTorch.
|
||||
runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
|
||||
fast: Enable fast mode. This will skip downloading models, and will raise errors if given model_id does not exists under local store.
|
||||
adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
|
||||
framework: The framework to use for this LLM. By default, this is set to ``pt``.
|
||||
additional_args: Additional arguments to pass to ``openllm start``.
|
||||
> [!NOTE] ``--workers-per-resource`` will also accept the following strategies:
|
||||
> - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
|
||||
> - ``conserved``: This will determine the number of available GPU resources, and only assign
|
||||
> one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
|
||||
> equivalent to ``--workers-per-resource 0.25``.
|
||||
device: Assign GPU devices (if available) to this LLM. By default, this is set to ``None``. It also accepts 'all'
|
||||
argument to assign all available GPUs to this LLM.
|
||||
quantize: Quantize the model weights. This is only applicable for PyTorch models.
|
||||
Possible quantisation strategies:
|
||||
- int8: Quantize the model with 8bit (bitsandbytes required)
|
||||
- int4: Quantize the model with 4bit (bitsandbytes required)
|
||||
- gptq: Quantize the model with GPTQ (auto-gptq required)
|
||||
bettertransformer: Convert given model to FastTransformer with PyTorch.
|
||||
runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
|
||||
fast: Enable fast mode. This will skip downloading models, and will raise errors if given model_id does not exists under local store.
|
||||
adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
|
||||
framework: The framework to use for this LLM. By default, this is set to ``pt``.
|
||||
additional_args: Additional arguments to pass to ``openllm start``.
|
||||
"""
|
||||
fast = os.getenv("OPENLLM_FAST", str(fast)).upper() in ENV_VARS_TRUE_VALUES
|
||||
llm_config = AutoConfig.for_model(model_name)
|
||||
@@ -554,48 +552,45 @@ def _build(
|
||||
|
||||
``openllm.build`` will invoke ``click.Command`` under the hood, so it behaves exactly the same as ``openllm build`` CLI.
|
||||
|
||||
> **Note**: ``quantize`` and ``bettertransformer`` are mutually exclusive.
|
||||
> [!NOTE] ``quantize`` and ``bettertransformer`` are mutually exclusive.
|
||||
|
||||
Args:
|
||||
model_name: The model name to start this LLM
|
||||
model_id: Optional model id for this given LLM
|
||||
model_version: Optional model version for this given LLM
|
||||
quantize: Quantize the model weights. This is only applicable for PyTorch models.
|
||||
Possible quantisation strategies:
|
||||
- int8: Quantize the model with 8bit (bitsandbytes required)
|
||||
- int4: Quantize the model with 4bit (bitsandbytes required)
|
||||
- gptq: Quantize the model with GPTQ (auto-gptq required)
|
||||
bettertransformer: Convert given model to FastTransformer with PyTorch.
|
||||
adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
|
||||
build_ctx: The build context to use for building BentoLLM. By default, it sets to current directory.
|
||||
enable_features: Additional OpenLLM features to be included with this BentoLLM.
|
||||
workers_per_resource: Number of workers per resource assigned.
|
||||
See https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy
|
||||
for more information. By default, this is set to 1.
|
||||
model_name: The model name to start this LLM
|
||||
model_id: Optional model id for this given LLM
|
||||
model_version: Optional model version for this given LLM
|
||||
quantize: Quantize the model weights. This is only applicable for PyTorch models.
|
||||
Possible quantisation strategies:
|
||||
- int8: Quantize the model with 8bit (bitsandbytes required)
|
||||
- int4: Quantize the model with 4bit (bitsandbytes required)
|
||||
- gptq: Quantize the model with GPTQ (auto-gptq required)
|
||||
bettertransformer: Convert given model to FastTransformer with PyTorch.
|
||||
adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
|
||||
build_ctx: The build context to use for building BentoLLM. By default, it sets to current directory.
|
||||
enable_features: Additional OpenLLM features to be included with this BentoLLM.
|
||||
workers_per_resource: Number of workers per resource assigned.
|
||||
See [resource scheduling](https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy)
|
||||
for more information. By default, this is set to 1.
|
||||
|
||||
> **Note**: ``--workers-per-resource`` will also accept the following strategies:
|
||||
|
||||
> - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
|
||||
|
||||
> - ``conserved``: This will determine the number of available GPU resources, and only assign
|
||||
one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
|
||||
equivalent to ``--workers-per-resource 0.25``.
|
||||
runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
|
||||
dockerfile_template: The dockerfile template to use for building BentoLLM. See
|
||||
https://docs.bentoml.com/en/latest/guides/containerization.html#dockerfile-template.
|
||||
overwrite: Whether to overwrite the existing BentoLLM. By default, this is set to ``False``.
|
||||
push: Whether to push the result bento to BentoCloud. Make sure to login with 'bentoml cloud login' first.
|
||||
containerize: Whether to containerize the Bento after building. '--containerize' is the shortcut of 'openllm build && bentoml containerize'.
|
||||
Note that 'containerize' and 'push' are mutually exclusive
|
||||
container_registry: Container registry to choose the base OpenLLM container image to build from. Default to ECR.
|
||||
container_version_strategy: The container version strategy. Default to the latest release of OpenLLM.
|
||||
serialisation_format: Serialisation for saving models. Default to 'safetensors', which is equivalent to `safe_serialization=True`
|
||||
additional_args: Additional arguments to pass to ``openllm build``.
|
||||
bento_store: Optional BentoStore for saving this BentoLLM. Default to the default BentoML local store.
|
||||
> [!NOTE] ``--workers-per-resource`` will also accept the following strategies:
|
||||
> - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
|
||||
> - ``conserved``: This will determine the number of available GPU resources, and only assign
|
||||
> one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
|
||||
> equivalent to ``--workers-per-resource 0.25``.
|
||||
runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
|
||||
dockerfile_template: The dockerfile template to use for building BentoLLM. See https://docs.bentoml.com/en/latest/guides/containerization.html#dockerfile-template.
|
||||
overwrite: Whether to overwrite the existing BentoLLM. By default, this is set to ``False``.
|
||||
push: Whether to push the result bento to BentoCloud. Make sure to login with 'bentoml cloud login' first.
|
||||
containerize: Whether to containerize the Bento after building. '--containerize' is the shortcut of 'openllm build && bentoml containerize'.
|
||||
Note that 'containerize' and 'push' are mutually exclusive
|
||||
container_registry: Container registry to choose the base OpenLLM container image to build from. Default to ECR.
|
||||
container_registry: Container registry to choose the base OpenLLM container image to build from. Default to ECR.
|
||||
container_version_strategy: The container version strategy. Default to the latest release of OpenLLM.
|
||||
serialisation_format: Serialisation for saving models. Default to 'safetensors', which is equivalent to `safe_serialization=True`
|
||||
additional_args: Additional arguments to pass to ``openllm build``.
|
||||
bento_store: Optional BentoStore for saving this BentoLLM. Default to the default BentoML local store.
|
||||
|
||||
Returns:
|
||||
``bentoml.Bento | str``: BentoLLM instance. This can be used to serve the LLM or can be pushed to BentoCloud.
|
||||
If 'format="container"', then it returns the default 'container_name:container_tag'
|
||||
``bentoml.Bento | str``: BentoLLM instance. This can be used to serve the LLM or can be pushed to BentoCloud.
|
||||
"""
|
||||
args: ListStr = [sys.executable, "-m", "openllm", "build", model_name, "--machine", "--runtime", runtime, "--serialisation", serialisation_format,]
|
||||
if quantize and bettertransformer: raise OpenLLMException("'quantize' and 'bettertransformer' are currently mutually exclusive.")
|
||||
@@ -633,31 +628,33 @@ def _import_model(
|
||||
) -> bentoml.Model:
|
||||
"""Import a LLM into local store.
|
||||
|
||||
> **Note**: If ``quantize`` is passed, the model weights will be saved as quantized weights. You should
|
||||
> [!NOTE]
|
||||
> If ``quantize`` is passed, the model weights will be saved as quantized weights. You should
|
||||
> only use this option if you want the weight to be quantized by default. Note that OpenLLM also
|
||||
> support on-demand quantisation during initial startup.
|
||||
|
||||
``openllm.download`` will invoke ``click.Command`` under the hood, so it behaves exactly the same as the CLI ``openllm import``.
|
||||
|
||||
> **Note**: ``openllm.start`` will automatically invoke ``openllm.download`` under the hood.
|
||||
> [!NOTE]
|
||||
> ``openllm.start`` will automatically invoke ``openllm.download`` under the hood.
|
||||
|
||||
Args:
|
||||
model_name: The model name to start this LLM
|
||||
model_id: Optional model id for this given LLM
|
||||
model_version: Optional model version for this given LLM
|
||||
runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
|
||||
implementation: The implementation to use for this LLM. By default, this is set to ``pt``.
|
||||
quantize: Quantize the model weights. This is only applicable for PyTorch models.
|
||||
Possible quantisation strategies:
|
||||
- int8: Quantize the model with 8bit (bitsandbytes required)
|
||||
- int4: Quantize the model with 4bit (bitsandbytes required)
|
||||
- gptq: Quantize the model with GPTQ (auto-gptq required)
|
||||
serialisation_format: Type of model format to save to local store. If set to 'safetensors', then OpenLLM will save model using safetensors.
|
||||
Default behaviour is similar to ``safe_serialization=False``.
|
||||
additional_args: Additional arguments to pass to ``openllm import``.
|
||||
model_name: The model name to start this LLM
|
||||
model_id: Optional model id for this given LLM
|
||||
model_version: Optional model version for this given LLM
|
||||
runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
|
||||
implementation: The implementation to use for this LLM. By default, this is set to ``pt``.
|
||||
quantize: Quantize the model weights. This is only applicable for PyTorch models.
|
||||
Possible quantisation strategies:
|
||||
- int8: Quantize the model with 8bit (bitsandbytes required)
|
||||
- int4: Quantize the model with 4bit (bitsandbytes required)
|
||||
- gptq: Quantize the model with GPTQ (auto-gptq required)
|
||||
serialisation_format: Type of model format to save to local store. If set to 'safetensors', then OpenLLM will save model using safetensors.
|
||||
Default behaviour is similar to ``safe_serialization=False``.
|
||||
additional_args: Additional arguments to pass to ``openllm import``.
|
||||
|
||||
Returns:
|
||||
``bentoml.Model``:BentoModel of the given LLM. This can be used to serve the LLM or can be pushed to BentoCloud.
|
||||
``bentoml.Model``:BentoModel of the given LLM. This can be used to serve the LLM or can be pushed to BentoCloud.
|
||||
"""
|
||||
args = [model_name, "--runtime", runtime, "--implementation", implementation, "--machine", "--serialisation", serialisation_format,]
|
||||
if model_id is not None: args.append(model_id)
|
||||
|
||||
Reference in New Issue
Block a user