> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Quantization

> Learn how to serve pre-quantized models or perform online quantization with Friendli Container to reduce memory and speed up inference.

## What Is Quantization

**Quantization** is a technique that reduces the precision of a generative AI model's parameters, optimizing memory usage and inference speed while maintaining response quality.

### Friendli Container Supports

* **Online quantization**: Quantize your model *<u>on the fly at serving time</u>*. You don’t need to prepare any pre-quantized weights in advance — just launch the model with the `--quantization` option, and the system will dynamically quantize it instantly as the container starts.
* **Serving a pre-quantized model**: Serve a model that has been *<u>already quantized beforehand</u>*. In this mode, you use model weights that were already quantized and simply load them during serving.

## Serving a Model with Online Quantization

If you want to serve your own model but need to quantize it (or adjust its precision), Friendli Container offers **online quantization**, eliminating the need to prepare a quantized model in advance.

Once your model is ready, you can serve it with online quantization by adding the `--quantization` argument when [running Friendli Container](/guides/container/configuration).

* `--quantization` `(8bit|4bit|16bit)`: Applies online quantization with the specified precision. It automatically detects your hardware and selects a suitable quantization scheme.

<Note>
  - Use `--quantization 8bit` for **NVIDIA Ada, Hopper, and Blackwell** GPUs.
  - Use `--quantization 4bit` for **NVIDIA Hopper, and Blackwell** GPUs.
</Note>

<Tip>
  To dequantize a model to 16-bit precision, use `--quantization 16bit`.
</Tip>

### Example: `deepseek-ai/DeepSeek-R1` with 4-Bit Online Quantization on NVIDIA H200 GPUs

```sh theme={null}
# GPU Info: NVIDIA H200 * 4
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1,2,3"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v $HF_HOME:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name deepseek-ai/DeepSeek-R1 \
    --quantization 4bit \
    --algo-policy-dir /policy \
    --search-policy true
```

<Tip>
  To serve online quantized models efficiently, you must run a policy search to explore the optimal execution policy. Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing-inference-with-policy-search#running-policy-search).
</Tip>

## Serving a Pre-Quantized Model

If you have already quantized and uploaded a model to the Hugging Face Hub, Friendli Container supports the model with the following options:

* **Quantized model with well-known quantizations:**
  * [MXFP4](https://huggingface.co/docs/transformers/en/quantization/mxfp4)
  * [**Fine-grained FP8**](https://huggingface.co/docs/transformers/quantization/finegrained_fp8) (including Deepseek-V3 style FP8 Quantization)
  * a subset of models created by:
    * [AWQ](https://huggingface.co/docs/transformers/en/quantization/awq)
    * [**AutoFP8**](https://github.com/neuralmagic/AutoFP8)
    * [**compressed-tensors**](https://huggingface.co/docs/transformers/en/quantization/compressed_tensors)
* [**Quantized model checkpoints by FriendliAI**](https://huggingface.co/FriendliAI)

### Example: `openai/gpt-oss-120b` on NVIDIA B200 GPU

```sh theme={null}
# GPU Info: NVIDIA B200
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name openai/gpt-oss-120b \
    --algo-policy-dir /policy \
    --search-policy true
```

<Tip>
  To serve pre-quantized models efficiently, you must run a policy search to explore the optimal execution policy. Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing-inference-with-policy-search#running-policy-search).
</Tip>
