> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Speculative Decoding

> Speed up LLM inference on Friendli Dedicated Endpoints with speculative decoding using proprietary draft models and N-gram token prediction.

Speculative decoding speeds up generation by drafting candidate tokens and verifying them with the target model in parallel, so the model accepts more tokens per forward pass. Friendli Dedicated Endpoints support two methods.

## Draft-Model Method

You can enable speculative decoding by pairing the target model with a pre-trained draft model. This improves inference efficiency by allowing a fast draft model to propose multiple tokens that the larger target model verifies in parallel. As a result, the model can accept multiple tokens per forward pass, increasing throughput.

<Note>This feature is currently limited to a curated list of target models.</Note>

## N-gram Method

You can toggle the switch to enable N-gram speculative decoding. When enabled, the system uses past tokens to pre-generate future tokens. For predictable tasks, this can deliver substantial performance gains.

You can also set the `Maximum N-gram Size`, which defines how many tokens are predicted in advance. We recommend keeping the default value of 3.

<Note>Higher values can further reduce latency when successful. However, predicting too many tokens at once may lower prediction efficiency and, in extreme cases, even increase latency.</Note>
