Preview in 5.0.x

Recall,
at every token.

The native inference daemon consults associative memory once, at the prompt boundary, before the model begins generating. A later 5.0.x patch extends the same hook to fire per generated token — letting the model pull supporting fragments as it composes the answer, not just before it starts. Longer answers stay grounded throughout; the trade-off is some additional latency.

Later in 5.0.x


Why per-token

Long answers drift.

The 5.0 smart-memory-inference path consults the matrix-memory layer once: at the prompt boundary, the relevant patterns get merged into the model's state, then the model generates the answer. For short answers, that's enough — the prompt-boundary recall covers the whole response.

For longer answers — a step-by-step procedure, a multi-section document summary, a piece of code that pivots on intermediate results — the relevant context shifts mid-answer. By token 200, what would have been most useful to recall isn't what got recalled at the start. The model can drift away from grounding it had access to but never re-consulted.

Per-token recall fires the same hook at every generated token. The cost: a small amount of overhead per token (1–2 ms on CPU, effectively hidden on GPU). The gain: longer answers stay grounded throughout.


Trade-offs

Honest about the cost.

Per-token recall isn't free. On CPU, an extra 1–2 ms per token means a 300-token answer costs an extra fraction of a second. For most customers, that's invisible. For high-throughput workloads (a chat shell serving many concurrent users on one node), it shifts the throughput-vs-grounding curve. Customers will pick the mode per workload.

The platform ships both modes side by side, with the existing prompt-boundary mode remaining the default. Per-token recall is opt-in via a per-request flag or a per-tenant admin setting — the same surface as the 5.0 smart-memory-inference feature.


What's pending

Honest gates on this page.

Still in flight

  • Per-token hook wired into the decoder loop (scaffolded in 5.0, activates in a later 5.0.x patch)
  • Latency budget tuning per workload class
  • Admin UI for per-tenant mode selection (prompt-boundary, per-token, or off)
  • Smoke tests covering long-form answers under per-token mode
  • Benchmark page showing latency curves for both modes on representative knowledge-base sizes

This page updates as each piece lands. The release notes are the formal cut.


Read next.

For prompt-boundary recall today, see smart memory inference. For the rest of the memory work, see memory scoping. For the full 5.0.x roadmap, see what's next in 5.0.x.