Skip to main content

Speculative Decoding with a Draft Model

Enables speculative decoding by pairing the target model with a pre-trained draft model. The draft model proposes multiple candidate tokens that are verified by the target model, reducing the number of decoding passes and improving throughput and latency in production inference workloads.
This feature is currently limited to a curated list of target models.

N-gram speculative decoding

You can toggle the switch to enable N-gram speculative decoding. When enabled, past tokens are leveraged to pre-generate future tokens. For predictable tasks, this can deliver substantial performance gains. You can also set the Maximum N-gram Size, which defines how many tokens are predicted in advance. We recommend keeping the default value of 3.
Higher values can further reduce latency when successful. However, predicting too many tokens at once may lower prediction efficiency and, in extreme cases, even increase latency.