Queueing Threshold
A queueing threshold defines the capacity at which an endpoint starts queueing requests. Friendli Dedicated Endpoints currently support one threshold type:- Request count: The average number of in-flight requests each replica should handle. The endpoint multiplies this value by the current number of running replicas to get its capacity. Once the number of in-flight requests exceeds that capacity, the extra requests are queued rather than routed to a replica.
The appropriate threshold depends on your model, GPU instance, and workload characteristics, so tune it to the point where the endpoint meets your target performance.
Queue Timeout
Queue timeout is optional and controls how long a request can stay in the queue:- Not set: Queued requests wait until capacity frees up.
- Set: When a queued request has waited longer than the timeout, the endpoint returns a
429 Too Many Requestsresponse, so the client can retry or fall back.
Set Up Request Queueing
Configure the Threshold and Timeout
Enter a Request Count Threshold (minimum 1) and, optionally, a Queue timeout in seconds. Leave the timeout empty or set it to 0 for No Limit.