Skip to main content
An endpoint is a running deployment of your model on a dedicated GPU. This page covers what you can configure on an endpoint and what you can monitor.

What You Can Configure

When you create or update an endpoint, you can configure the following options:
  • Name: The name of the endpoint.
  • Model: The model to serve, from Hugging Face or uploaded models, with optional LoRA adapters.
  • Instance type: The GPU type and count for the endpoint.
  • Scaling configuration: The range of replicas used to scale with traffic.
  • Online Quantization: Improves serving efficiency with FriendliAI’s proprietary quantization method. Choose off, 8-bit, or 4-bit.
  • Speculative Decoding: Speeds up generation by drafting candidate tokens and verifying them in parallel, using a draft model or n-gram speculation.
  • Host KV Cache: Additional host memory for KV cache storage, extending total KV capacity beyond GPU memory limits (may add to startup time).
  • Engine configuration: Special token handling and maximum batch size.
  • Request logging: Whether to log request content (default: off).
  • Reasoning Parser: The default parse_reasoning behavior, applied when the argument isn’t provided in a request.
  • Custom Chat Template: A Jinja template that overrides the model’s default template.
  • Version comment: An optional note describing each deployed version of the endpoint.

What You Can Monitor

For each endpoint, you can monitor the following:

Status

An endpoint moves through the following statuses:
  • Initializing — The endpoint is starting up after creation, across three phases: initializing the GPU, downloading the model, and initializing the engine.
  • Running — At least one replica is available to serve requests.
  • Updating — A change to the endpoint’s spec is being applied.
  • Sleeping — The endpoint freed its GPUs after the cooldown period with no requests.
  • Waking up — The endpoint is returning from sleeping to running, doing the same work as initializing. You can trigger a wake-up manually or by sending a request.
  • Terminated — The endpoint has been terminated.
  • Failed — The endpoint has failed initialization for some reason.

Versions

Each endpoint keeps a deployment history, where every version captures a snapshot of its configuration with a comment. You can compare changes between versions and roll back to a previous one without downtime.

Metrics

The Metrics tab provides charts for monitoring performance and usage:
  • Processed requests
  • Processed tokens
  • Time to first token
  • Time per output token
  • Request latency
  • Number of replicas
  • Cost per million tokens
  • Overall traffic (2xx, 4xx, and 5xx responses)
Some charts may not be available depending on the model type.

KV Cache Size

You can see the endpoint’s current KV cache size. To make it larger, enable the Host KV Cache option, which extends total KV capacity beyond GPU memory limits.
Last modified on June 24, 2026