Request Queueing - FriendliAI Docs

Queueing Threshold

A queueing threshold defines the capacity at which an endpoint starts queueing requests. Friendli Dedicated Endpoints currently support one threshold type:

Request count: The average number of in-flight requests each replica should handle. The endpoint multiplies this value by the current number of running replicas to get its capacity. Once the number of in-flight requests exceeds that capacity, the extra requests are queued rather than routed to a replica.

The appropriate threshold depends on your model, GPU instance, and workload characteristics, so tune it to the point where the endpoint meets your target performance.

Queue Timeout

Queue timeout is optional and controls how long a request can stay in the queue:

Not set: Queued requests wait until capacity frees up.

Set: When a queued request has waited longer than the timeout, the endpoint returns a 429 Too Many Requests response, so the client can retry or fall back.

Set a queue timeout when you would rather reject a request than let it wait too long.

Set Up Request Queueing

Find the Endpoint Features Section

While creating or updating an endpoint, go to Endpoint Features.

Enable Request Queueing

Turn on Request Queueing.

Configure the Threshold and Timeout

Enter a Request Count Threshold (minimum 1) and, optionally, a Queue timeout in seconds. Leave the timeout empty or set it to 0 for No Limit.

Deploy or Update the Endpoint

Click Deploy for a new endpoint, or Update to apply changes to an existing one.

​Queueing Threshold

​Queue Timeout

​Set Up Request Queueing

Queueing Threshold

Queue Timeout

Set Up Request Queueing