Supported Instance Types
Contact sales for a discounted custom pricing plan for your enterprise.
How Does Billing Work for Dedicated Endpoints?
Billing is measured per GPU-second for each replica. An endpoint can run multiple replicas at once, each metered independently for the time it’s running. For each replica:- Billing accrues once the replica has finished initializing. The startup work—provisioning GPUs, downloading the model, and initializing the engine—is not billed.
- Once running, the replica is billed for as long as it stays up, even when it’s idle and receiving no traffic.
- Billing stops as soon as the replica begins shutting down, such as during scale-down, sleep, or termination.
How Does Autoscaling Affect My Costs?
Autoscaling adjusts the number of replicas to match traffic, and each replica is metered by the same rules above. Your cost rises and falls with the number of running replicas—for example, running 2 replicas instead of 1 doubles your GPU cost.Best Practices for Cost Management
- Monitor running endpoints: Regularly review which endpoints are running and how many replicas they use, so you don’t pay for capacity you don’t need.
- Enable sleeping for endpoints with intermittent traffic: Set the minimum replica count to 0 so an endpoint sleeps when idle and wakes automatically on the next request.