> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# QuickStart: Friendli Container Trial

> Get started with Friendli Container trial. Access the registry, configure your secret, launch the container, and monitor with Grafana.

export const RoundedBorderBox = ({children, caption}) => <div className="rounded-border-box">
    {children}
    {caption && <p className="text-sm text-gray-700 dark:text-gray-400">{caption}</p>}
  </div>;

Get started with [Friendli Container](/guides/container/introduction). This quickstart walks you through running your first container—from trial access to your first inference request—and serving an LLM in a secure, private environment.

For detailed launch options, multi-GPU serving, and the full option reference, see [Configuration](/guides/container/configuration).

## Prerequisites

* **Hardware Requirements**: Friendli Container currently only targets x86\_64 architecture and supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to [our required CUDA compatibility guide](/guides/container/cuda-compatibility).
* **Software Requirements**: Your machine should be able to run containers with the [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html). In this tutorial, we will use Docker as container runtime and make use of [Docker Compose](https://docs.docker.com/compose).
* **Model Compatibility**: If your model is in a [safetensors](https://huggingface.co/docs/safetensors/index) format, which is compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers), you can serve the model directly with the Friendli Container. Please check our [Model library](https://friendli.ai/models) for the non-exhaustive list of supported models.

This tutorial assumes that your model of choice is uploaded to [Hugging Face](https://huggingface.co) and you have access to it. If the model is gated or private, you need to prepare a [Hugging Face Access Token](https://huggingface.co/settings/tokens).

## Getting Access to Friendli Container

### Activate Your Free Trial

[Contact sales](https://friendli.ai/contact) to activate your free trial.

### Get Access to the Container Registry

You need a Personal API key to log in to the container registry.

1. Go to [Friendli Suite > Personal Settings > API Keys](https://friendli.ai/suite/~/setting/keys) and click 'Create API Key'.
2. Save the API key you just created.

### Prepare Your Container Secret

A container secret is a code that activates Friendli Container. You pass the container secret as an environment variable when running the container image.

1. Go to [Friendli Suite > Container > Container Secrets](https://friendli.ai/suite/~/container/secrets) and click 'Create secret'.
2. Save the secret you just created.

<Note>
  **🔑 Secret Rotation**

  You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be automatically revoked in **30** minutes.
</Note>

## Running Friendli Container

### Pull the Friendli Container Image

1. Log in to the container registry using the email address for your Friendli Suite account and the Personal API key.

```sh theme={null}
export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS"
export API_KEY="YOUR_API_KEY"
docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $API_KEY
```

2. Pull the image.

```sh theme={null}
docker pull registry.friendli.ai/trial
```

### Run Friendli Container with a Hugging Face Model

1. Clone our [container resource](https://github.com/friendliai/container-resource) git repository.

```sh theme={null}
git clone https://github.com/friendliai/container-resource
cd container-resource/quickstart/docker-compose
```

2. Set up environment variables.

   ```sh theme={null}
    export HF_MODEL_NAME="<...>"  # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
    export FRIENDLI_CONTAINER_SECRET="<...>"  # Friendli container secret
   ```

   If your model is a private or gated one, you also need to provide [Hugging Face Access Token](https://huggingface.co/settings/tokens).

   ```sh theme={null}
    export HF_TOKEN="<...>"  # Hugging Face Access Token
   ```

3. Launch the Friendli Container.

   ```sh theme={null}
   docker compose up -d
   ```

<Note>
  By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below.

  ```sh theme={null}
  export FRIENDLI_PORT="8001"
  export FRIENDLI_GRAFANA_PORT="3001"
  ```
</Note>

<Note>
  Even though the machine has multiple GPUs, the container will make use of only one GPU, specifically the first GPU (`device_ids: ['0']`). You can edit `docker-compose.yaml` to change what GPU device the container will use.
</Note>

<Note>
  The downloaded Hugging Face model will be cached in the `$HOME/.cache/huggingface` directory. You may want to clean up this directory after completing this tutorial.
</Note>

### Send Inference Requests

You can now send inference requests to the running container. For information on all available parameters, refer to the [API reference](/openapi).

<CodeGroup>
  ```sh curl theme={null}
  curl -X POST http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {"role": "user", "content": "What makes a good leader?"}
      ],
      "max_tokens": 30
    }'
  ```

  ```python OpenAI Python SDK theme={null}
  import os
  from openai import OpenAI

  client = OpenAI(
      base_url="http://0.0.0.0:8000/v1"
  )

  completion = client.chat.completions.create(
      model="",
      messages=[
          {"role": "user", "content": "What makes a good leader?"}
      ],
      max_tokens=30,
      stream=True
  )
  for chunk in completion:
      print(chunk.choices[0].delta.content, end="", flush=True)
  ```

  ```python Friendli Python SDK theme={null}
  from friendli import SyncFriendli

  client = SyncFriendli()

  stream = client.container.chat.complete(
      messages=[{"role": "user", "content": "Python is a popular"}],
      max_tokens=30,
      stream=True,
  )
  for chunk in stream:
      print(chunk.text, end="", flush=True)
  ```

  ```sh Completion theme={null}
  curl -X POST http://0.0.0.0:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "What makes a good leader?",
      "max_tokens": 30
    }'
  ```

  ```sh Tokenization theme={null}
  curl -X POST http://0.0.0.0:8000/v1/tokenize \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "What is generative AI?"
    }'
  ```

  ```sh Detokenization theme={null}
  curl -X POST http://0.0.0.0:8000/v1/detokenize \
    -H "Content-Type: application/json" \
    -d '{
      "tokens": [
        128000,
        3923,
        374,
        1803,
        1413,
        15592,
        30
      ]
    }'
  ```
</CodeGroup>

<Note>
  Chat completion requests work only if the model's tokenizer config contains a `chat_template`.
</Note>

### Monitor with Grafana

Using your browser, open `http://0.0.0.0:3000/d/friendli-engine`, and log in with username `admin` and password `admin`. You can now access the dashboards showing useful engine metrics.

<RoundedBorderBox>
  <img alt="Grafana Dashboard" src="https://mintcdn.com/friendliai/SRK7vx0X1v_2rjkU/static/images/guides/container/grafana-template-dashboard-example.png?fit=max&auto=format&n=SRK7vx0X1v_2rjkU&q=85&s=16f13d2b171b076f6b9b7f37115ea63d" width="6016" height="3078" data-path="static/images/guides/container/grafana-template-dashboard-example.png" />
</RoundedBorderBox>

<Note>
  If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine.

  ```sh theme={null}
  # Change these variables to match your environment.
  LOCAL_GRAFANA_PORT=3000  # The number of the port in your PC.
  FRIENDLI_GRAFANA_PORT=3000  # The number of the port in the GPU machine.

  ssh "$GPU_MACHINE_ADDRESS" -L "$LOCAL_GRAFANA_PORT:0.0.0.0:$FRIENDLI_GRAFANA_PORT"
  ```

  You should replace `$GPU_MACHINE_ADDRESS` with the address of the GPU machine. You may also use the `-l login_name` or `-p port` options to connect to the GPU machine using SSH.

  Then using your browser on the PC, open `http://0.0.0.0:$LOCAL_GRAFANA_PORT/d/friendli-engine`.
</Note>

## Going Further

Congratulations! You can now serve your LLM of choice using your hardware, with the power of the most efficient LLM serving engine on the planet. The following topics will help you go further through your AI endeavors.

* **Multi-GPU Serving**: Although this tutorial is limited to using only one GPU, Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference. Check [Multi-GPU Serving](/guides/container/configuration#multi-gpu-serving) for more information.
* **Serving Multi-LoRA Models**: You can deploy multiple customized LLMs without additional GPU resources. See [Serving Multi-LoRA Models](/guides/container/serving-multi-lora-models) to learn how to launch the container with your adapters.
* **Quantization**: Friendli Container supports **online quantization**, which quantizes a model instantly when you launch it. Also, you can serve a pre-quantized model, too. Check [Quantization](/guides/container/quantization) for more information.
* **Serving MoE Models**: Running MoE (Mixture of Experts) models requires an additional step of [execution policy search](/guides/container/optimizing-inference-with-policy-search). See [Serving MoE Models](/guides/container/serving-moe-models) to learn how to launch the container with MoE models.

If you are stuck or need help going through this tutorial, please ask for support by sending an email to [Support](mailto:support@friendli.ai).