+ Prefill is the compute-heavy pass that consumes the entire prompt and
+ builds a KV cache. Decode is the memory-bandwidth-bound loop that emits
+ tokens sequentially from that cache. The two phases have very different
+ bottlenecks, so running them on different hardware can be substantially
+ faster than doing both on one node.
+
+
+
+
+ ▶
+ Linking Instances
+
+
+
+ A linked route here tells the cluster: when a request is sent to a
+ model in that cluster, the decode node (or the least active one if
+ there are multiple) will handle it. If it decides it must do a lot of
+ prefill not already cached in the prefix cache, it routes the request
+ to the prefill node over TCP IP. The prefill node streams the KV cache
+ back to the decode node which picks up from there.
+
+
+ Linked instances must be running the same model family — KV layouts
+ differ across architectures. More on the blog.
+
+ ⚠ Multi-node instance detected. Remote prefill currently only
+ works on single-node (rank-0) instances. This route will not
+ function until that's supported.
+
+ {/if}
+ {#if row.families.length > 1}
+
+ ⚠ Mixed model families: {row.families.join(", ")}
+
+ {/if}
+
+ Prefill
+
+ Decode
+
+
+
+ {#each row.prefill as id (id)}
+ {@const r = instanceById[id]}
+ {#if r}
+
+ ⚠ Selected instances span multiple model families: {editingFamilies.join(", ")}. Linking across families produces a corrupt KV cache.
+
+ {/if}
+
+ {#if editingMultiNode.length > 0}
+
+ ⚠ Multi-node instance(s) selected: {editingMultiNode.join(", ")}. Remote prefill currently only works on single-node instances. This
+ route will not function until multi-node support lands.
+
+ {/if}
+
+
+ Pick a role for each instance:
+ Prefill
+ serves KV cache,
+ Decode consumes it.
+
+
+ {#each instanceRows as row (row.id)}
+ {@const role = roleOf(row.id)}
+