> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Configuration

> Configuration reference for Friendli Container: how to pass launch options, serve across multiple GPUs, enable quantization, and run MoE models.

This page is the configuration reference for Friendli Container—how to pass launch options, serve across multiple GPUs, and tune serving for your model. If you haven't run a container yet, start with the [Quickstart](/guides/container/quickstart).

Friendli Container supports direct loading of [`safetensors`](https://huggingface.co/docs/safetensors/index) checkpoints—compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers)—for many model types. You can find the complete list of supported models on the [Supported Models page](https://friendli.ai/models?products=CONTAINER). If your model is not on the list, please [contact support](mailto:support@friendli.ai).

## Passing Launch Options

Launch options are passed as arguments after the image name in your `docker run` command:

```sh theme={null}
# Fill the values of following variables.
export HF_MODEL_NAME=""  # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret

docker run --gpus '"device=0"' -p 8000:8000 \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  registry.friendli.ai/trial \
    --hf-model-name $HF_MODEL_NAME \
    [LAUNCH_OPTIONS]
```

Replace `[LAUNCH_OPTIONS]` with the options described in [Launch Options](#launch-options). Running the command above starts a Docker container that exposes an HTTP endpoint for handling inference requests.

## Multi-GPU Serving

Friendli Container supports ***tensor parallelism*** and ***pipeline parallelism*** for multi-GPU inference.

### Tensor Parallelism

Use tensor parallelism when serving large models that exceed the memory capacity of a single GPU. It distributes parts of the model's weights across multiple GPUs. To use tensor parallelism with Friendli Container:

1. Specify multiple GPUs for `$GPU_ENUMERATION` (e.g., '"device=0,1,2,3"').
2. Use `--num-devices` (or `-d`) option to specify the tensor parallelism degree (e.g., `--num-devices 4`).

### Examples

<Tabs>
  <Tab title="Deploying Models on a Single GPU">
    This is an example running [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) with a single GPU.

    ```sh theme={null}
    export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret (leave it if it's already set in your environment)
    export HF_TOKEN=""  # Access token from Hugging Face (see the caution below)

    docker run -p 8000:8000 --gpus '"device=0"' \
      -e HF_TOKEN=$HF_TOKEN \
      -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      registry.friendli.ai/trial \
        --hf-model-name meta-llama/Llama-3.1-8B-Instruct
    ```

    <Warning>
      Since downloading `meta-llama/Llama-3.1-8B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable. It works the same for all private repositories.
    </Warning>
  </Tab>

  <Tab title="Deploying Models on Multi-GPU">
    This is an example running [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) with a multi-GPU setup.

    ```sh {5,11} theme={null}
    export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret (leave it if it's already set in your environment)
    export HF_TOKEN=""  # Access token from Hugging Face (see the caution below)

    docker run -p 8000:8000 \
      --ipc=host --gpus '"device=0,1"' \
      -e HF_TOKEN=$HF_TOKEN \
      -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      registry.friendli.ai/trial \
        --hf-model-name meta-llama/Llama-3.1-70B-Instruct \
        --num-devices 2
    ```

    <Warning>
      Since downloading `meta-llama/Llama-3.1-70B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable. It works the same for all private repositories.
    </Warning>
  </Tab>
</Tabs>

## Quantization

Friendli Container supports **online quantization**, which quantizes a model instantly when you launch it, as well as serving pre-quantized models. If your model is already quantized or needs to be quantized, check [Quantization](/guides/container/quantization) for more details.

## Serving MoE Models

Running MoE (Mixture of Experts) models requires an additional step to search the execution policy. See [Serving MoE Models](/guides/container/serving-moe-models) to learn how to launch Friendli Container for the MoE model.

## Options for Running Friendli Container

### General Options

| Options     | Type | Summary                                | Default | Required |
| ----------- | ---- | -------------------------------------- | ------- | -------- |
| `--version` | -    | Print Friendli Container version.      | -       | ❌        |
| `--help`    | -    | Print Friendli Container help message. | -       | ❌        |

### Launch Options

| Options                           | Type                        | Summary                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Default             | Required |
| --------------------------------- | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | -------- |
| `--web-server-port`               | INT                         | Web server port.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 8000                | ❌        |
| `--metrics-port`                  | INT                         | Prometheus metrics export port.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 8281                | ❌        |
| `--hf-model-name`                 | TEXT                        | Model name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at \~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before launching the container. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format. | -                   | ❌        |
| `--tokenizer-file-path`           | TEXT                        | Absolute path of tokenizer file. This option is not needed when `tokenizer.json` is located under the path specified at `--ckpt-path`.                                                                                                                                                                                                                                                                                                                                                                                      | -                   | ❌        |
| `--tokenizer-add-special-tokens`  | BOOLEAN                     | Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer's `add_special_tokens` argument. The default value is **false** for versions \< v1.6.0.                                                                                                                                                                                                                                                                                                                                          | `true`              | ❌        |
| `--tokenizer-skip-special-tokens` | BOOLEAN                     | Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer's `skip_special_tokens` argument.                                                                                                                                                                                                                                                                                                                                                                                           | `true`              | ❌        |
| `--dtype`                         | CHOICE: \[bf16, fp16, fp32] | Data type of weights and activations. Choose one of \<fp16\|bf16\|fp32>. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of `torch_dtype` in `config.json` file or assumes fp16.                                                                                                                                                                                                                                                                     | fp16                | ❌        |
| `--bad-stop-file-path`            | TEXT                        | JSON file path that contains stop sequences or bad words/tokens.                                                                                                                                                                                                                                                                                                                                                                                                                                                            | -                   | ❌        |
| `--num-request-threads`           | INT                         | Thread pool size for handling HTTP requests.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 4                   | ❌        |
| `--timeout-microseconds`          | INT                         | Server-side timeout for client requests, in microseconds.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 0 (no timeout)      | ❌        |
| `--ignore-nan-error`              | BOOLEAN                     | If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request.                                                                                                                                                                                                                                                                                                                                                                                          | -                   | ❌        |
| `--max-batch-size`                | INT                         | Max number of sequences that can be processed in a batch.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 384                 | ❌        |
| `--num-devices`, `-d`             | INT                         | Number of devices to use in tensor parallelism degree.                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 1                   | ❌        |
| `--search-policy`                 | BOOLEAN                     | Searches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing-inference-with-policy-search).                                                                                                                                                                                                                                                                           | false               | ❌        |
| `--terminate-after-search`        | BOOLEAN                     | Terminates engine container after the policy search.                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | false               | ❌        |
| `--algo-policy-dir`               | TEXT                        | Path to directory containing the policy file. The default value is the current working directory. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing-inference-with-policy-search).                                                                                                                                                                                                                                                                                   | current working dir | ❌        |
| `--adapter-model`                 | TEXT                        | Add an adapter model with adapter name and path; \<adapter\_name>:\<adapter\_ckpt\_path>. The path can be a name from a Hugging Face model hub.                                                                                                                                                                                                                                                                                                                                                                             | -                   | ❌        |

### Model Specific Options

#### T5

| Options               | Type | Summary                | Default | Required |
| --------------------- | ---- | ---------------------- | ------- | -------- |
| `--max-input-length`  | INT  | Maximum input length.  | -       | ✅        |
| `--max-output-length` | INT  | Maximum output length. | -       | ✅        |