chore(docs): update to obsidian README format

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2026-03-09 18:48:09 -04:00 · 2023-08-02 21:49:33 -04:00
parent b349820429
commit af64a6dfd5
6 changed files with 140 additions and 133 deletions
--- a/README.md
+++ b/README.md
@@ -135,7 +135,8 @@ specify different variants of the model to be served, by providing the
 openllm start flan-t5 --model-id google/flan-t5-large
 ```

-> **Note** that `openllm` also supports all variants of fine-tuning weights,
+> [!NOTE]
+> `openllm` also supports all variants of fine-tuning weights,
 > custom model path as well as quantized weights for any of the supported models
 > as long as it can be loaded with the model architecture. Refer to
 > [supported models](https://github.com/bentoml/OpenLLM/tree/main#-supported-models)
@@ -417,7 +418,8 @@ For example, if you want to use the Tensorflow (`tf`) implementation for the
 OPENLLM_FLAN_T5_FRAMEWORK=tf openllm start flan-t5
 ```

-> **Note** For GPU support on Flax, refers to
+> [!NOTE]
+> For GPU support on Flax, refers to
 > [Jax's installation](https://github.com/google/jax#pip-installation-gpu-cuda-installed-via-pip-easier)
 > to make sure that you have Jax support for the corresponding CUDA version.

@@ -437,7 +439,8 @@ To run inference with `gptq`, simply pass `--quantize gptq`:
 openllm start falcon --model-id TheBloke/falcon-40b-instruct-GPTQ --quantize gptq --device 0
 ```

-> **Note**: to run GPTQ, make sure to install with
+> [!NOTE]
+> In order to run GPTQ, make sure to install with
 > `pip install "openllm[gptq]"`. The weights of all supported models should be
 > quantized before serving. See
 > [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) for more
@@ -482,7 +485,8 @@ To include this into the Bento, one can also provide a `--adapter-id` into
 openllm build opt --model-id facebook/opt-6.7b --adapter-id ...
 ```

-> **Note**: We will gradually roll out support for fine-tuning all models. The
+> [!NOTE]
+> We will gradually roll out support for fine-tuning all models. The
 > following models contain fine-tuning support: OPT, Falcon, LlaMA.

 ### Integrating a New Model
@@ -527,8 +531,8 @@ client = openllm.client.HTTPClient("http://localhost:3000")
 client.embed("I like to eat apples")
 ```

-> **Note**: Currently, the following model framily supports embeddings: Llama,
-> T5 (Flan-T5, FastChat, etc.), ChatGLM
+> [!NOTE]
+> Currently, the following model framily supports embeddings: Llama, T5 (Flan-T5, FastChat, etc.), ChatGLM

 ## ⚙️ Integrations

@@ -606,7 +610,8 @@ def chat(input_text: str):
    return agent.run(input_text)
 ```

-> **Note** You can find out more examples under the
+> [!NOTE]
+> You can find out more examples under the
 > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.

 ### Transformers Agents
@@ -614,7 +619,8 @@ def chat(input_text: str):
 OpenLLM seamlessly integrates with
 [Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).

-> **Warning** The Transformers Agent is still at an experimental stage. It is
+> [!WARNING]
+> The Transformers Agent is still at an experimental stage. It is
 > recommended to install OpenLLM with `pip install -r nightly-requirements.txt`
 > to get the latest API update for HuggingFace agent.

@@ -626,7 +632,8 @@ agent = transformers.HfAgent("http://localhost:3000/hf/agent")  # URL that runs
 agent.run("Is the following `text` positive or negative?", text="I don't like how this models is generate inputs")
 ```

-> **Note** Only `starcoder` is currently supported with Agent integration. The
+> [!IMPORTANT]
+> Only `starcoder` is currently supported with Agent integration. The
 > example above was also run with four T4s on EC2 `g4dn.12xlarge`

 If you want to use OpenLLM client to ask questions to the running agent, you can
@@ -646,6 +653,7 @@ client.ask_agent(
 <!-- hatch-fancy-pypi-readme interim stop -->

 ![Gif showing Agent integration](/assets/agent.gif)
+
 <br/>

 <!-- hatch-fancy-pypi-readme meta start -->
@@ -691,7 +699,8 @@ serverless cloud for shipping and scaling AI applications.
   bentoml cloud login --api-token <your-api-token> --endpoint <bento-cloud-endpoint>
   ```

-> **Note**: Replace `<your-api-token>` and `<bento-cloud-endpoint>` with your
+> [!NOTE]
+> Replace `<your-api-token>` and `<bento-cloud-endpoint>` with your
 > specific API token and the BentoCloud endpoint respectively.

 3. **Bulding a Bento**: With OpenLLM, you can easily build a Bento for a
--- a/hatch.toml
+++ b/hatch.toml
@@ -11,7 +11,7 @@ text = """

 """
 [[metadata.hooks.fancy-pypi-readme.fragments]]
-end-before = "\n<!-- hatch-fancy-pypi-readme intro stop -->\n"
+end-before = "\n<!-- hatch-fancy-pypi-readme intro stop -->"
 path = "README.md"
 start-after = "<!-- hatch-fancy-pypi-readme intro start -->\n"
 [[metadata.hooks.fancy-pypi-readme.fragments]]
@@ -22,11 +22,12 @@ text = """
 </p>
 """
 [[metadata.hooks.fancy-pypi-readme.fragments]]
-end-before = "\n<!-- hatch-fancy-pypi-readme interim stop -->\n"
+end-before = "\n<!-- hatch-fancy-pypi-readme interim stop -->"
 path = "README.md"
 start-after = "<!-- hatch-fancy-pypi-readme interim start -->\n"
 [[metadata.hooks.fancy-pypi-readme.fragments]]
 text = """
+
 <p align="center">
  <img src="https://raw.githubusercontent.com/bentoml/openllm/main/assets/agent.gif" alt="Gif showing Agent integration" />
 </p>
@@ -43,6 +44,7 @@ text = """
 """
 [[tool.hatch.metadata.hooks.fancy-pypi-readme.fragments]]
 path = "CHANGELOG.md"
+start-after = "<!-- towncrier release notes start -->"
 pattern = "\n(###.+?\n)## "
 [[metadata.hooks.fancy-pypi-readme.fragments]]
 text = """
--- a/src/openllm/_configuration.py
+++ b/src/openllm/_configuration.py
@@ -1423,8 +1423,8 @@ class LLMConfig(_ConfigAttr):

    This can be used as a decorator for click commands.

-    > **Note**: that the identifier for all LLMConfig will be prefixed with '<model_name>_*', and the generation config
-    will be prefixed with '<model_name>_generation_*'.
+    > [!NOTE]
+    > The identifier for all LLMConfig will be prefixed with '<model_name>_*', and the generation config will be prefixed with '<model_name>_generation_*'.
    """
    for name, field in attr.fields_dict(cls.__openllm_generation_class__).items():
      ty = cls.__openllm_hints__.get(name)
--- a/src/openllm/_llm.py
+++ b/src/openllm/_llm.py
@@ -223,7 +223,8 @@ class LLMInterface(ABC, t.Generic[M, T]):

    You can customize how the output of the LLM looks with this hook. By default, it is a simple echo.

-    NOTE: this will be used from the client side.
+    > [!NOTE]
+    > This will be used from the client side.
    """
    return generation_result
  def llm_post_init(self) -> None:
@@ -271,7 +272,7 @@ class LLMInterface(ABC, t.Generic[M, T]):
    - If `self.bettertransformer` is set within `llm_post_init`.
    - Finally, if none of the above, default to self.config['bettertransformer']

-    > **Note** that if LoRA is enabled, bettertransformer will be disabled.
+    > [!NOTE] that if LoRA is enabled, bettertransformer will be disabled.
    """
  device: "torch.device"
  """The device to be used for this LLM. If the implementation is 'pt', then it will be torch.device, else string."""
@@ -562,7 +563,7 @@ class LLM(LLMInterface[M, T], ReprMixin):

    Args:
        model_id: The pretrained model to use. Defaults to None. If None, 'self.default_id' will be used.
-                  > **Warning**: If custom path is passed, make sure it contains all available file to construct
+                  > [!WARNING] If custom path is passed, make sure it contains all available file to construct
                  > ``transformers.PretrainedConfig``, ``transformers.PreTrainedModel``, and ``transformers.PreTrainedTokenizer``.
        model_name: Optional model name to be saved with this LLM. Default to None. It will be inferred automatically from model_id.
                    If model_id is a custom path, it will be the basename of the given path.
@@ -629,7 +630,7 @@ class LLM(LLMInterface[M, T], ReprMixin):
    If model_id contains the revision itself, then the same format above
    If model_id is a path, then it will be <framework>-<basename_of_path>:<generated_sha1> if model_version is not passesd, otherwise <framework>-<basename_of_path>:<model_version>

-    **Note** here that the generated SHA1 for path cases is that it will be based on last modified time.
+    > [!NOTE] here that the generated SHA1 for path cases is that it will be based on last modified time.

    Args:
        model_id: Model id for this given LLM. It can be pretrained weights URL, custom path.
@@ -664,14 +665,14 @@ class LLM(LLMInterface[M, T], ReprMixin):
  ):
    """Initialize the LLM with given pretrained model.

-    > **Warning**
+    > [!WARNING]
    > To initializing any LLM, you should use `openllm.AutoLLM` or `openllm.LLM.from_pretrained` instead.
    > `__init__` initialization is only for internal use.

-    Note:
-    - *args to be passed to the model.
-    - **attrs will first be parsed to the AutoConfig, then the rest will be parsed to the import_model
-    - for tokenizer kwargs, it should be prefixed with _tokenizer_*
+    > [!NOTE]
+    > - *args to be passed to the model.
+    > - **attrs will first be parsed to the AutoConfig, then the rest will be parsed to the import_model
+    > - for tokenizer kwargs, it should be prefixed with _tokenizer_*

    For custom pretrained path, it is recommended to pass in 'model_version' alongside with the path
    to ensure that it won't be loaded multiple times.
@@ -925,10 +926,11 @@ class LLM(LLMInterface[M, T], ReprMixin):
    Returns:
      A generated LLMRunner for this LLM.

-    > **Note**: There are some difference between bentoml.models.get().to_runner() and LLM.to_runner(): 'name'.
-    - 'name': will be generated by OpenLLM, hence users don't shouldn't worry about this. The generated name will be 'llm-<model-start-name>-runner' (ex: llm-dolly-v2-runner, llm-chatglm-runner)
-    - 'embedded': Will be disabled by default. There is no reason to run LLM in embedded mode.
-    - 'method_configs': The method configs for the runner will be managed internally by OpenLLM.
+    > [!NOTE]: There are some difference between bentoml.models.get().to_runner() and LLM.to_runner():
+    >
+    > - 'name': will be generated by OpenLLM, hence users don't shouldn't worry about this. The generated name will be 'llm-<model-start-name>-runner' (ex: llm-dolly-v2-runner, llm-chatglm-runner)
+    > - 'embedded': Will be disabled by default. There is no reason to run LLM in embedded mode.
+    > - 'method_configs': The method configs for the runner will be managed internally by OpenLLM.
    """
    models = models if models is not None else []

@@ -946,7 +948,8 @@ class LLM(LLMInterface[M, T], ReprMixin):
    # NOTE: returning the two langchain API's to the runner
    return llm_runner_class(self)(
        llm_runnable_class(self, embeddings_sig, generate_sig, generate_iterator_sig), name=self.runner_name, embedded=False, models=models, max_batch_size=max_batch_size, max_latency_ms=max_latency_ms,
-        method_configs=bentoml_cattr.unstructure({"embeddings": embeddings_sig, "__call__": generate_sig, "generate": generate_sig, "generate_one": generate_sig, "generate_iterator": generate_iterator_sig}), scheduling_strategy=scheduling_strategy,
+        method_configs=bentoml_cattr.unstructure({"embeddings": embeddings_sig, "__call__": generate_sig, "generate": generate_sig, "generate_one": generate_sig, "generate_iterator": generate_iterator_sig}),
+        scheduling_strategy=scheduling_strategy,
    )

  # NOTE: Scikit API
@@ -972,16 +975,11 @@ class LLM(LLMInterface[M, T], ReprMixin):
@overload
 def Runner(model_name: str, *, model_id: str | None = None, model_version: str | None = ..., init_local: t.Literal[False, True] = ..., **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
@overload
-def Runner(
-    model_name: str, *, model_id: str = ..., model_version: str | None = ..., models: list[bentoml.Model] | None = ..., max_batch_size: int | None = ..., max_latency_ms: int | None = ..., method_configs: dict[str, ModelSignatureDict | ModelSignature] | None = ..., embedded: t.Literal[True, False] = ..., scheduling_strategy: type[bentoml.Strategy] | None = ..., **attrs: t.Any
-) -> LLMRunner[t.Any, t.Any]: ...
+def Runner(model_name: str, *, model_id: str = ..., model_version: str | None = ..., models: list[bentoml.Model] | None = ..., max_batch_size: int | None = ..., max_latency_ms: int | None = ..., method_configs: dict[str, ModelSignatureDict | ModelSignature] | None = ..., embedded: t.Literal[True, False] = ..., scheduling_strategy: type[bentoml.Strategy] | None = ..., **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
@overload
 def Runner(model_name: str, *, ensure_available: bool | None = None, init_local: bool = ..., implementation: LiteralRuntime | None = None, llm_config: LLMConfig | None = None, **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
@overload
-def Runner(
-    model_name: str, *, model_id: str | None = ..., model_version: str | None = ..., llm_config: LLMConfig | None = ..., runtime: t.Literal["ggml", "transformers"] | None = ..., quantize: t.Literal["int8", "int4", "gptq"] | None = ..., bettertransformer: str | bool | None = ..., adapter_id: str | None = ..., adapter_name: str | None = ...,
-    adapter_map: dict[str, str | None] | None = ..., quantization_config: transformers.BitsAndBytesConfig | autogptq.BaseQuantizeConfig | None = None, serialisation: t.Literal["safetensors", "legacy"] = ..., **attrs: t.Any
-) -> LLMRunner[t.Any, t.Any]: ...
+def Runner(model_name: str, *, model_id: str | None = ..., model_version: str | None = ..., llm_config: LLMConfig | None = ..., runtime: t.Literal["ggml", "transformers"] | None = ..., quantize: t.Literal["int8", "int4", "gptq"] | None = ..., bettertransformer: str | bool | None = ..., adapter_id: str | None = ..., adapter_name: str | None = ..., adapter_map: dict[str, str | None] | None = ..., quantization_config: transformers.BitsAndBytesConfig | autogptq.BaseQuantizeConfig | None = None, serialisation: t.Literal["safetensors", "legacy"] = ..., **attrs: t.Any) -> LLMRunner[t.Any, t.Any]: ...
 # fmt: on

 def Runner(model_name: str, ensure_available: bool | None = None, init_local: bool = False, implementation: LiteralRuntime | None = None, llm_config: LLMConfig | None = None, **attrs: t.Any) -> LLMRunner[t.Any, t.Any]:
@@ -1017,7 +1015,8 @@ def Runner(model_name: str, ensure_available: bool | None = None, init_local: bo
    behaviour
  """
  if llm_config is not None:
-    attrs.update({"model_id": llm_config["env"]["model_id_value"], "bettertransformer": llm_config["env"]["bettertransformer_value"], "quantize": llm_config["env"]["quantize_value"], "runtime": llm_config["env"]["runtime_value"], "serialisation": first_not_none(os.getenv("OPENLLM_SERIALIZATION"), attrs.get("serialisation"), default="safetensors"),})
+    attrs.update({"model_id": llm_config["env"]["model_id_value"], "bettertransformer": llm_config["env"]["bettertransformer_value"], "quantize": llm_config["env"]["quantize_value"], "runtime": llm_config["env"]["runtime_value"],
+                  "serialisation": first_not_none(os.getenv("OPENLLM_SERIALIZATION"), attrs.get("serialisation"), default="safetensors")})

  default_implementation = llm_config.default_implementation() if llm_config is not None else "pt"
  implementation = first_not_none(implementation, default=EnvVarMixin(model_name, default_implementation)["framework_value"])
--- a/src/openllm/cli/_factory.py
+++ b/src/openllm/cli/_factory.py
@@ -373,12 +373,12 @@ def quantize_option(f: _AnyCallable | None = None, *, build: bool = False, model

                                                                                                                                                                            - ``gptq``: ``GPTQ`` [quantization](https://arxiv.org/abs/2210.17323)

-                                                                                                                                                                            **Note** that the model can also be served with quantized weights.
+                                                                                                                                                                            > [!NOTE] that the model can also be served with quantized weights.
                                                                                                                                                                            """ + (
          """
-                                                                                                                                                                            **Note** that this will set the mode for serving within deployment.""" if build else ""
+                                                                                                                                                                            > [!NOTE] that this will set the mode for serving within deployment.""" if build else ""
      ) + """
-                                                                                                                                                                            **Note** that quantization are currently only available in *PyTorch* models.""", **attrs)(f)
+                                                                                                                                                                            > [!NOTE] that quantization are currently only available in *PyTorch* models.""", **attrs)(f)

 def workers_per_resource_option(f: _AnyCallable | None = None, *, build: bool = False, **attrs: t.Any) -> t.Callable[[FC], FC]:
  return cli_option(
@@ -387,16 +387,16 @@ def workers_per_resource_option(f: _AnyCallable | None = None, *, build: bool =
                                                                                                                                                  See https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy
                                                                                                                                                  for more information. By default, this is set to 1.

-                                                                                                                                                  **Note**: ``--workers-per-resource`` will also accept the following strategies:
+                                                                                                                                                  > [!NOTE] ``--workers-per-resource`` will also accept the following strategies:

                                                                                                                                                  - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.

                                                                                                                                                  - ``conserved``: This will determine the number of available GPU resources, and only assign one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is equivalent to ``--workers-per-resource 0.25``.
                                                                                                                                                  """ + (
          """\n
-                                                                                                                                                  **Note**: The workers value passed into 'build' will determine how the LLM can
-                                                                                                                                                  be provisioned in Kubernetes as well as in standalone container. This will
-                                                                                                                                                  ensure it has the same effect with 'openllm start --workers ...'""" if build else ""
+                                                                                                                                                  > [!NOTE] The workers value passed into 'build' will determine how the LLM can
+                                                                                                                                                  > be provisioned in Kubernetes as well as in standalone container. This will
+                                                                                                                                                  > ensure it has the same effect with 'openllm start --workers ...'""" if build else ""
      ), **attrs)(f)

 def bettertransformer_option(f: _AnyCallable | None = None, *, build: bool = False, model_env: EnvVarMixin | None = None, **attrs: t.Any) -> t.Callable[[FC], FC]:
@@ -416,13 +416,13 @@ def serialisation_option(f: _AnyCallable | None = None, **attrs: t.Any) -> t.Cal
                                                                                                                              ``safe_serialization=True``.

                                                                                                                              \b
-                                                                                                                              **Note** that this format might not work for every cases, and
+                                                                                                                              > [!NOTE] that this format might not work for every cases, and
                                                                                                                              you can always fallback to ``legacy`` if needed.

                                                                                                                  - ``legacy``: This will use PyTorch serialisation format, often as ``.bin`` files.
                                                                                                                                  This should be used if the model doesn't yet support safetensors.

-                                                                                                                  **Note** that GGML format is working in progress.
+                                                                                                                  > [!NOTE] that GGML format is working in progress.
                                                                                                                  """, **attrs)(f)

 def container_registry_option(f: _AnyCallable | None = None, **attrs: t.Any) -> t.Callable[[FC], FC]:
@@ -432,7 +432,7 @@ def container_registry_option(f: _AnyCallable | None = None, **attrs: t.Any) ->
                                                                                                                        Currently, it supports 'ecr', 'ghcr.io', 'docker.io'

                                                                                                                        \b
-                                                                                                                        **Note** that in order to build the base image, you will need a GPUs to compile custom kernel. See ``openllm ext build-base-container`` for more information.
+                                                                                                                        > [!NOTE] that in order to build the base image, you will need a GPUs to compile custom kernel. See ``openllm ext build-base-container`` for more information.
                                                                                                                        """)(f)

 _wpr_strategies = {"round_robin", "conserved"}
--- a/src/openllm/cli/entrypoint.py
+++ b/src/openllm/cli/entrypoint.py
@@ -332,17 +332,17 @@ class OpenLLMCommandGroup(BentoMLCommandGroup):
    def group(self, name: _AnyCallable) -> click.Group:
      ...

+    # variant: name omitted, cls _must_ be a keyword argument, @group(cmd=GroupCls, ...)
+    @overload
+    def group(self, name: None = None, *, cls: t.Type[GrpType], **attrs: t.Any) -> t.Callable[[_AnyCallable], GrpType]:
+      ...
+
    # variant: with positional name and with positional or keyword cls argument:
    # @group(namearg, GroupCls, ...) or @group(namearg, cls=GroupCls, ...)
    @overload
    def group(self, name: str | None, cls: type[GrpType], **attrs: t.Any) -> t.Callable[[_AnyCallable], GrpType]:
      ...

-    # variant: name omitted, cls _must_ be a keyword argument, @group(cmd=GroupCls, ...)
-    @overload
-    def group(self, name: None = None, *, cls: t.Type[GrpType], **attrs: t.Any) -> t.Callable[[_AnyCallable], GrpType]:
-      ...
-
    # variant: with optional string name, no cls argument provided.
    @overload
    def group(self, name: str | None = ..., cls: None = None, **attrs: t.Any) -> t.Callable[[_AnyCallable], click.Group]:
@@ -451,7 +451,7 @@ def import_command(model_name: str, model_id: str | None, converter: str | None,
  $ CONVERTER=llama2-hf openllm import llama /path/to/llama-2
  ```

-  > **Note**: This behaviour will override ``--runtime``. Therefore make sure that the LLM contains correct conversion strategies to both GGML and HF.
+  > [!WARNING] This behaviour will override ``--runtime``. Therefore make sure that the LLM contains correct conversion strategies to both GGML and HF.
  """
  llm_config = AutoConfig.for_model(model_name)
  env = EnvVarMixin(model_name, llm_config.default_implementation(), model_id=model_id, runtime=runtime, quantize=quantize)
@@ -484,41 +484,39 @@ def _start(
  For all additional arguments, pass it as string to ``additional_args``. For example, if you want to
  pass ``--port 5001``, you can pass ``additional_args=["--port", "5001"]``

-  > **Note**: This will create a blocking process, so if you use this API, you can create a running sub thread
+  > [!NOTE] This will create a blocking process, so if you use this API, you can create a running sub thread
  > to start the server instead of blocking the main thread.

  ``openllm.start`` will invoke ``click.Command`` under the hood, so it behaves exactly the same as the CLI interaction.

-  > **Note**: ``quantize`` and ``bettertransformer`` are mutually exclusive.
+  > [!NOTE] ``quantize`` and ``bettertransformer`` are mutually exclusive.

  Args:
-  model_name: The model name to start this LLM
-  model_id: Optional model id for this given LLM
-  timeout: The server timeout
-  workers_per_resource: Number of workers per resource assigned.
-  See https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy
-  for more information. By default, this is set to 1.
+      model_name: The model name to start this LLM
+      model_id: Optional model id for this given LLM
+      timeout: The server timeout
+      workers_per_resource: Number of workers per resource assigned.
+                            See [resource scheduling](https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy)
+                            for more information. By default, this is set to 1.

-  > **Note**: ``--workers-per-resource`` will also accept the following strategies:
-
-  > - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
-
-  > - ``conserved``: Thjis will determine the number of available GPU resources, and only assign
-  one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
-  equivalent to ``--workers-per-resource 0.25``.
-  device: Assign GPU devices (if available) to this LLM. By default, this is set to ``None``. It also accepts 'all'
-  argument to assign all available GPUs to this LLM.
-  quantize: Quantize the model weights. This is only applicable for PyTorch models.
-  Possible quantisation strategies:
-  - int8: Quantize the model with 8bit (bitsandbytes required)
-  - int4: Quantize the model with 4bit (bitsandbytes required)
-  - gptq: Quantize the model with GPTQ (auto-gptq required)
-  bettertransformer: Convert given model to FastTransformer with PyTorch.
-  runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
-  fast: Enable fast mode. This will skip downloading models, and will raise errors if given model_id does not exists under local store.
-  adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
-  framework: The framework to use for this LLM. By default, this is set to ``pt``.
-  additional_args: Additional arguments to pass to ``openllm start``.
+                            > [!NOTE] ``--workers-per-resource`` will also accept the following strategies:
+                            > - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
+                            > - ``conserved``: This will determine the number of available GPU resources, and only assign
+                            >                  one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
+                            >                  equivalent to ``--workers-per-resource 0.25``.
+      device: Assign GPU devices (if available) to this LLM. By default, this is set to ``None``. It also accepts 'all'
+      argument to assign all available GPUs to this LLM.
+      quantize: Quantize the model weights. This is only applicable for PyTorch models.
+                Possible quantisation strategies:
+                - int8: Quantize the model with 8bit (bitsandbytes required)
+                - int4: Quantize the model with 4bit (bitsandbytes required)
+                - gptq: Quantize the model with GPTQ (auto-gptq required)
+      bettertransformer: Convert given model to FastTransformer with PyTorch.
+      runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
+      fast: Enable fast mode. This will skip downloading models, and will raise errors if given model_id does not exists under local store.
+      adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
+      framework: The framework to use for this LLM. By default, this is set to ``pt``.
+      additional_args: Additional arguments to pass to ``openllm start``.
  """
  fast = os.getenv("OPENLLM_FAST", str(fast)).upper() in ENV_VARS_TRUE_VALUES
  llm_config = AutoConfig.for_model(model_name)
@@ -554,48 +552,45 @@ def _build(

  ``openllm.build`` will invoke ``click.Command`` under the hood, so it behaves exactly the same as ``openllm build`` CLI.

-  > **Note**: ``quantize`` and ``bettertransformer`` are mutually exclusive.
+  > [!NOTE] ``quantize`` and ``bettertransformer`` are mutually exclusive.

  Args:
-  model_name: The model name to start this LLM
-  model_id: Optional model id for this given LLM
-  model_version: Optional model version for this given LLM
-  quantize: Quantize the model weights. This is only applicable for PyTorch models.
-  Possible quantisation strategies:
-  - int8: Quantize the model with 8bit (bitsandbytes required)
-  - int4: Quantize the model with 4bit (bitsandbytes required)
-  - gptq: Quantize the model with GPTQ (auto-gptq required)
-  bettertransformer: Convert given model to FastTransformer with PyTorch.
-  adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
-  build_ctx: The build context to use for building BentoLLM. By default, it sets to current directory.
-  enable_features: Additional OpenLLM features to be included with this BentoLLM.
-  workers_per_resource: Number of workers per resource assigned.
-  See https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy
-  for more information. By default, this is set to 1.
+      model_name: The model name to start this LLM
+      model_id: Optional model id for this given LLM
+      model_version: Optional model version for this given LLM
+      quantize: Quantize the model weights. This is only applicable for PyTorch models.
+                Possible quantisation strategies:
+                - int8: Quantize the model with 8bit (bitsandbytes required)
+                - int4: Quantize the model with 4bit (bitsandbytes required)
+                - gptq: Quantize the model with GPTQ (auto-gptq required)
+      bettertransformer: Convert given model to FastTransformer with PyTorch.
+      adapter_map: The adapter mapping of LoRA to use for this LLM. It accepts a dictionary of ``{adapter_id: adapter_name}``.
+      build_ctx: The build context to use for building BentoLLM. By default, it sets to current directory.
+      enable_features: Additional OpenLLM features to be included with this BentoLLM.
+      workers_per_resource: Number of workers per resource assigned.
+                            See [resource scheduling](https://docs.bentoml.org/en/latest/guides/scheduling.html#resource-scheduling-strategy)
+                            for more information. By default, this is set to 1.

-  > **Note**: ``--workers-per-resource`` will also accept the following strategies:
-
-  > - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
-
-  > - ``conserved``: This will determine the number of available GPU resources, and only assign
-  one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
-  equivalent to ``--workers-per-resource 0.25``.
-  runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
-  dockerfile_template: The dockerfile template to use for building BentoLLM. See
-  https://docs.bentoml.com/en/latest/guides/containerization.html#dockerfile-template.
-  overwrite: Whether to overwrite the existing BentoLLM. By default, this is set to ``False``.
-  push: Whether to push the result bento to BentoCloud. Make sure to login with 'bentoml cloud login' first.
-  containerize: Whether to containerize the Bento after building. '--containerize' is the shortcut of 'openllm build && bentoml containerize'.
-  Note that 'containerize' and 'push' are mutually exclusive
-  container_registry: Container registry to choose the base OpenLLM container image to build from. Default to ECR.
-  container_version_strategy: The container version strategy. Default to the latest release of OpenLLM.
-  serialisation_format: Serialisation for saving models. Default to 'safetensors', which is equivalent to `safe_serialization=True`
-  additional_args: Additional arguments to pass to ``openllm build``.
-  bento_store: Optional BentoStore for saving this BentoLLM. Default to the default BentoML local store.
+                            > [!NOTE] ``--workers-per-resource`` will also accept the following strategies:
+                            > - ``round_robin``: Similar behaviour when setting ``--workers-per-resource 1``. This is useful for smaller models.
+                            > - ``conserved``: This will determine the number of available GPU resources, and only assign
+                            >                  one worker for the LLMRunner. For example, if ther are 4 GPUs available, then ``conserved`` is
+                            >                  equivalent to ``--workers-per-resource 0.25``.
+      runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
+      dockerfile_template: The dockerfile template to use for building BentoLLM. See https://docs.bentoml.com/en/latest/guides/containerization.html#dockerfile-template.
+      overwrite: Whether to overwrite the existing BentoLLM. By default, this is set to ``False``.
+      push: Whether to push the result bento to BentoCloud. Make sure to login with 'bentoml cloud login' first.
+      containerize: Whether to containerize the Bento after building. '--containerize' is the shortcut of 'openllm build && bentoml containerize'.
+                    Note that 'containerize' and 'push' are mutually exclusive
+                    container_registry: Container registry to choose the base OpenLLM container image to build from. Default to ECR.
+      container_registry: Container registry to choose the base OpenLLM container image to build from. Default to ECR.
+      container_version_strategy: The container version strategy. Default to the latest release of OpenLLM.
+      serialisation_format: Serialisation for saving models. Default to 'safetensors', which is equivalent to `safe_serialization=True`
+      additional_args: Additional arguments to pass to ``openllm build``.
+      bento_store: Optional BentoStore for saving this BentoLLM. Default to the default BentoML local store.

  Returns:
-  ``bentoml.Bento | str``: BentoLLM instance. This can be used to serve the LLM or can be pushed to BentoCloud.
-  If 'format="container"', then it returns the default 'container_name:container_tag'
+      ``bentoml.Bento | str``: BentoLLM instance. This can be used to serve the LLM or can be pushed to BentoCloud.
  """
  args: ListStr = [sys.executable, "-m", "openllm", "build", model_name, "--machine", "--runtime", runtime, "--serialisation", serialisation_format,]
  if quantize and bettertransformer: raise OpenLLMException("'quantize' and 'bettertransformer' are currently mutually exclusive.")
@@ -633,31 +628,33 @@ def _import_model(
 ) -> bentoml.Model:
  """Import a LLM into local store.

-  > **Note**: If ``quantize`` is passed, the model weights will be saved as quantized weights. You should
+  > [!NOTE]
+  > If ``quantize`` is passed, the model weights will be saved as quantized weights. You should
  > only use this option if you want the weight to be quantized by default. Note that OpenLLM also
  > support on-demand quantisation during initial startup.

  ``openllm.download`` will invoke ``click.Command`` under the hood, so it behaves exactly the same as the CLI ``openllm import``.

-  > **Note**: ``openllm.start`` will automatically invoke ``openllm.download`` under the hood.
+  > [!NOTE]
+  > ``openllm.start`` will automatically invoke ``openllm.download`` under the hood.

  Args:
-  model_name: The model name to start this LLM
-  model_id: Optional model id for this given LLM
-  model_version: Optional model version for this given LLM
-  runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
-  implementation: The implementation to use for this LLM. By default, this is set to ``pt``.
-  quantize: Quantize the model weights. This is only applicable for PyTorch models.
-  Possible quantisation strategies:
-  - int8: Quantize the model with 8bit (bitsandbytes required)
-  - int4: Quantize the model with 4bit (bitsandbytes required)
-  - gptq: Quantize the model with GPTQ (auto-gptq required)
-  serialisation_format: Type of model format to save to local store. If set to 'safetensors', then OpenLLM will save model using safetensors.
-  Default behaviour is similar to ``safe_serialization=False``.
-  additional_args: Additional arguments to pass to ``openllm import``.
+      model_name: The model name to start this LLM
+      model_id: Optional model id for this given LLM
+      model_version: Optional model version for this given LLM
+      runtime: The runtime to use for this LLM. By default, this is set to ``transformers``. In the future, this will include supports for GGML.
+      implementation: The implementation to use for this LLM. By default, this is set to ``pt``.
+      quantize: Quantize the model weights. This is only applicable for PyTorch models.
+                Possible quantisation strategies:
+                - int8: Quantize the model with 8bit (bitsandbytes required)
+                - int4: Quantize the model with 4bit (bitsandbytes required)
+                - gptq: Quantize the model with GPTQ (auto-gptq required)
+      serialisation_format: Type of model format to save to local store. If set to 'safetensors', then OpenLLM will save model using safetensors.
+      Default behaviour is similar to ``safe_serialization=False``.
+      additional_args: Additional arguments to pass to ``openllm import``.

  Returns:
-  ``bentoml.Model``:BentoModel of the given LLM. This can be used to serve the LLM or can be pushed to BentoCloud.
+      ``bentoml.Model``:BentoModel of the given LLM. This can be used to serve the LLM or can be pushed to BentoCloud.
  """
  args = [model_name, "--runtime", runtime, "--implementation", implementation, "--machine", "--serialisation", serialisation_format,]
  if model_id is not None: args.append(model_id)