GPUs and Pricing - FriendliAI Docs

Dedicated Endpoints offer flexible monthly billing based on actual usage.

Supported Instance Types

Contact sales for a discounted custom pricing plan for your enterprise.

How Does Billing Work for Dedicated Endpoints?

Billing is measured per GPU-second for each replica. An endpoint can run multiple replicas at once, each metered independently for the time it’s running. For each replica:

Billing accrues once the replica has finished initializing. The startup work—provisioning GPUs, downloading the model, and initializing the engine—is not billed.
Once running, the replica is billed for as long as it stays up, even when it’s idle and receiving no traffic.
Billing stops as soon as the replica begins shutting down, such as during scale-down, sleep, or termination.

Therefore, scaling up adds cost for each new replica once it’s running. An endpoint that always runs at least one replica keeps accruing charges.

How Does Autoscaling Affect My Costs?

Autoscaling adjusts the number of replicas to match traffic, and each replica is metered by the same rules above. Your cost rises and falls with the number of running replicas—for example, running 2 replicas instead of 1 doubles your GPU cost.

Best Practices for Cost Management

Monitor running endpoints: Regularly review which endpoints are running and how many replicas they use, so you don’t pay for capacity you don’t need.
Enable sleeping for endpoints with intermittent traffic: Set the minimum replica count to 0 so an endpoint sleeps when idle and wakes automatically on the next request.

Last modified on June 30, 2026

QuickStart: Friendli Dedicated Endpoints

Endpoints

⌘I

​Supported Instance Types

​How Does Billing Work for Dedicated Endpoints?

​How Does Autoscaling Affect My Costs?

​Best Practices for Cost Management

Supported Instance Types

How Does Billing Work for Dedicated Endpoints?

How Does Autoscaling Affect My Costs?

Best Practices for Cost Management