> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Autoscaling

> Configure autoscaling for Friendli Dedicated Endpoints to automatically adjust GPU replicas based on traffic and latency thresholds.

export const RoundedBorderBox = ({children, caption}) => <div className="rounded-border-box">
    {children}
    {caption && <p className="text-sm text-gray-700 dark:text-gray-400">{caption}</p>}
  </div>;

Friendli Dedicated Endpoints provide autoscaling that automatically adjusts computational resources based on your traffic patterns, helping you optimize both performance and costs.

<RoundedBorderBox>
  <img alt="Autoscaling Config" style={{ maxWidth: "600px", width: "-webkit-fill-available" }} src="https://mintcdn.com/friendliai/SRK7vx0X1v_2rjkU/static/images/guides/dedicated-endpoints/tutorial/autoscaling-config.png?fit=max&auto=format&n=SRK7vx0X1v_2rjkU&q=85&s=6485338094ff440a4e7b06dffdac5a45" width="1576" height="1058" data-path="static/images/guides/dedicated-endpoints/tutorial/autoscaling-config.png" />
</RoundedBorderBox>

## How Autoscaling Works

* **Minimum Replicas**:
  * When set to 0, the endpoint enters sleeping status during periods of inactivity, helping to minimize costs
  * When set to a value greater than 0, the endpoint maintains at least that number of active replicas at all times
* **Maximum Replicas**: Defines the upper limit of replicas that can be created to handle increased traffic load
* **Cooldown Period**: Measured in seconds; if no requests are received during this period, the endpoint transitions to sleeping status.

## Scaling Policies

<Danger>
  We highly recommend using the **Default** autoscaling type, as it performs reliably for most workloads.
  Performance degradation or unexpected charges may occur with other configurations if you don't fully understand your workload characteristics.
</Danger>

* **Default** (Recommended): This is the best choice for the majority of users. It operates reliably across most workloads with no configuration required, leveraging our internal expertise to provide a balanced approach to performance and cost.
* **Request count**: This is an advanced option for users who have a deep understanding of their workload characteristics and require granular control over scaling behavior.
  * As users define the number of requests a single worker will handle, cost prediction becomes more straightforward and intuitive.
  * This method can serve as a foundation for implementing your own custom autoscaling logic by dynamically changing the threshold via an API, targeting custom metrics.

## Benefits of Autoscaling

* **Cost Optimization**: Pay only for the resources you need for your workload.
* **Performance Management**: Handle traffic spikes efficiently.
* **Resource Efficiency**: Maintain optimal resource utilization for your workload.
