> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Optimize with Policy Search

> Boost inference throughput by up to 2x for MoE and quantized models by running execution policy search in Friendli Container for production.

## Introduction

For specialized cases, like **serving MoE models (e.g., Mixtral)** or **quantized models**, performance of inference can be further optimized through an execution policy search.
This process can be skipped, but it is necessary to get the optimized speed of Friendli Engine.
When the Friendli Engine runs with the optimal policy, the performance can increase by 1.5x to 2x (i.e., throughput and latency).
Therefore, we recommend skipping policy search for simple model testing, and performing policy search for cost analysis or latency analysis in a production service.

<Note>
  Policy search is effective only when serving (1) MoE models (2) AWQ, FP8 or INT8 quantized models. Otherwise, it is useless.
</Note>

## Running Policy Search

You can run policy search by adding the following options to the launch command of Friendli Container.

| Options                    | Type    | Summary                                                                                                                                                             | Default             |
| -------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- |
| `--algo-policy-dir`        | TEXT    | Path to the directory to save the searched optimal policy file. The default value is the current working directory.                                                 | current working dir |
| `--search-policy`          | BOOLEAN | Runs policy search to find the best Friendli execution policy for the given configuration such as model type, GPU, NVIDIA driver version, quantization scheme, etc. | false               |
| `--terminate-after-search` | BOOLEAN | Terminates engine container after policy search.                                                                                                                    | false               |

### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

For example, you can start the policy search for [FriendliAI/Llama-3.1-8B-Instruct-fp8](https://huggingface.co/FriendliAI/Llama-3.1-8B-Instruct-fp8) model as follows:

```sh theme={null}
export HF_MODEL_NAME="FriendliAI/Llama-3.1-8B-Instruct-fp8"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0"'
export POLICY_DIR=$PWD/policy

mkdir -p $POLICY_DIR

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name $HF_MODEL_NAME \
    --algo-policy-dir /policy \
    --search-policy true
```

### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

```sh theme={null}
export HF_MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0,1,2,3"'
export POLICY_DIR=$PWD/policy

mkdir -p $POLICY_DIR

docker run -p 8000:8000 \
  --ipc=host --gpus $GPU_ENUMERATION \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name $HF_MODEL_NAME \
    --num-devices 4 \
    --algo-policy-dir /policy \
    --search-policy true
```

Once the policy search is complete, a policy file will be created in `$POLICY_DIR`.
If the policy file already exists, the engine will search only the necessary spaces and update the policy file accordingly.
After the policy search, the engine starts to serve the endpoint using the policy file.

<Note>
  It takes up to several minutes to find the optimal policy for Llama 2 13B model with NVIDIA A100 80GB GPU.
  The estimated time and remaining time will be displayed in the stderr when you run the policy search.
</Note>

## Running Policy Search Without Starting Serving Endpoint

To search for the best policy without starting the serving endpoint, launch the engine with the Friendli Container command and include the `--terminate-after-search true` option.

### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

```sh theme={null}
docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8 \
    --algo-policy-dir /policy \
    --search-policy true \
    --terminate-after-search true
```

### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

```sh theme={null}
docker run -p 8000:8000 \
  --ipc=host --gpus $GPU_ENUMERATION \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --num-devices 4 \
    --algo-policy-dir /policy \
    --search-policy true \
    --terminate-after-search true
```

## FAQ: When to Run Policy Search Again

The execution policy depends on the following factors:

* Model
* GPU
* GPU count and parallelism degree (The value for `--num-devices` option)
* NVIDIA Driver major version
* Friendli Container version

You should run policy search again when any of these are changed from your serving setup.
