What You Can Configure
When you create or update an endpoint, you can configure the following options:- Name: The name of the endpoint.
- Model: The model to serve, from Hugging Face or uploaded models, with optional LoRA adapters.
- Instance type: The GPU type and count for the endpoint.
- Scaling configuration: The range of replicas used to scale with traffic.
- Online Quantization: Improves serving efficiency with FriendliAI’s proprietary quantization method. Choose off, 8-bit, or 4-bit.
- Speculative Decoding: Speeds up generation by drafting candidate tokens and verifying them in parallel, using a draft model or n-gram speculation.
- Host KV Cache: Additional host memory for KV cache storage, extending total KV capacity beyond GPU memory limits (may add to startup time).
- Engine configuration: Special token handling and maximum batch size.
- Request logging: Whether to log request content (default: off).
- Reasoning Parser: The default
parse_reasoningbehavior, applied when the argument isn’t provided in a request. - Custom Chat Template: A Jinja template that overrides the model’s default template.
- Version comment: An optional note describing each deployed version of the endpoint.
What You Can Monitor
For each endpoint, you can monitor the following:Status
An endpoint moves through the following statuses:- Initializing — The endpoint is starting up after creation, across three phases: initializing the GPU, downloading the model, and initializing the engine.
- Running — At least one replica is available to serve requests.
- Updating — A change to the endpoint’s spec is being applied.
- Sleeping — The endpoint freed its GPUs after the cooldown period with no requests.
- Waking up — The endpoint is returning from sleeping to running, doing the same work as initializing. You can trigger a wake-up manually or by sending a request.
- Terminated — The endpoint has been terminated.
- Failed — The endpoint has failed initialization for some reason.
Versions
Each endpoint keeps a deployment history, where every version captures a snapshot of its configuration with a comment. You can compare changes between versions and roll back to a previous one without downtime.Metrics
The Metrics tab provides charts for monitoring performance and usage:- Processed requests
- Processed tokens
- Time to first token
- Time per output token
- Request latency
- Number of replicas
- Cost per million tokens
- Overall traffic (2xx, 4xx, and 5xx responses)
Some charts may not be available depending on the model type.