> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Endpoints

> Configure, monitor, and manage Friendli Dedicated Endpoints. Learn about endpoint configuration options, statuses, and the metrics you can monitor.

export const RoundedBorderBox = ({children, caption}) => <div className="rounded-border-box">
    {children}
    {caption && <p className="text-sm text-gray-700 dark:text-gray-400">{caption}</p>}
  </div>;

An endpoint is a running deployment of your model on a dedicated GPU. This page covers what you can configure on an endpoint and what you can monitor.

## What You Can Configure

When you create or update an endpoint, you can configure the following options:

* **Name**: The name of the endpoint.
* **Model**: The model to serve, from **Hugging Face**{/*, **Weights & Biases**,*/} or [uploaded models](/guides/dedicated-endpoints/models), with optional [LoRA adapters](/guides/dedicated-endpoints/lora-serving).
* **Instance type**: The GPU type and count for the endpoint.
* **[Scaling configuration](/guides/dedicated-endpoints/autoscaling)**: The range of replicas used to scale with traffic.
* **[Online Quantization](/guides/dedicated-endpoints/online-quantization)**: Improves serving efficiency with FriendliAI's proprietary quantization method. Choose off, 8-bit, or 4-bit.
* **[Speculative Decoding](/guides/dedicated-endpoints/speculative-decoding)**: Speeds up generation by drafting candidate tokens and verifying them in parallel, using a draft model or n-gram speculation.
* **Host KV Cache**: Additional host memory for KV cache storage, extending total KV capacity beyond GPU memory limits (may add to startup time).
* **Engine configuration**: Special token handling and maximum batch size.
* **Request logging**: Whether to log request content (default: off).
* **[Reasoning Parser](/guides/reasoning#reasoning-parser)**: The default `parse_reasoning` behavior, applied when the argument isn't provided in a request.
* **Custom Chat Template**: A [Jinja](https://jinja.palletsprojects.com/en/stable/) template that overrides the model's default template.
* **[Version comment](/guides/dedicated-endpoints/versioning)**: An optional note describing each deployed version of the endpoint.

## What You Can Monitor

For each endpoint, you can monitor the following:

### Status

An endpoint moves through the following statuses:

* **Initializing** — The endpoint is starting up after creation, across three phases: initializing the GPU, downloading the model, and initializing the engine.
* **Running** — At least one replica is available to serve requests.
* **Updating** — A change to the endpoint's spec is being applied.
* **Sleeping** — The endpoint freed its GPUs after the cooldown period with no requests.
* **Waking up** — The endpoint is returning from sleeping to running, doing the same work as initializing. You can trigger a wake-up manually or by sending a request.
* **Terminated** — The endpoint has been terminated.
* **Failed** — The endpoint has failed initialization for some reason.

### Versions

Each endpoint keeps a [deployment history](/guides/dedicated-endpoints/versioning), where every version captures a snapshot of its configuration with a comment. You can compare changes between versions and roll back to a previous one without downtime.

### Metrics

The **Metrics** tab provides charts for monitoring performance and usage:

* Processed requests
* Processed tokens
* Time to first token
* Time per output token
* Request latency
* Number of replicas
* Cost per million tokens
* Overall traffic (2xx, 4xx, and 5xx responses)

<Note>
  Some charts may not be available depending on the model type.
</Note>

### KV Cache Size

You can see the endpoint's current KV cache size. To make it larger, enable the **Host KV Cache** option, which extends total KV capacity beyond GPU memory limits.