Smart memory inference (preview)

The model
consults your memory.

A preview feature in 5.0 wires the inference daemon to your knowledge-base memory directly. Before the model answers, it pulls relevant context from your installation's matrix memory; the answer is grounded in your own data without a separate retrieval round-trip. Opt-in per request, sub-2 ms per-token overhead, Pro+ tier.


What it does

Recall, in the decoder.

The standard pattern for grounding a model in your own data is two steps: a retrieval call (knowledge-base search) feeds the model the relevant context, then the model answers. Smart memory inference collapses that into one: the inference daemon already has access to the matrix-memory layer of the knowledge base your tenant points at; at the prompt boundary, the relevant patterns are consulted and merged into the model's decoding state before it produces the first token.

What changes for the user:


Performance

What the overhead looks like.

On our demo cluster (CPU-only on the controller host), the recall lookup adds 1–2 milliseconds per token at typical knowledge-base sizes. For a 300-token answer, that's an extra fraction of a second of decode time. For workloads on a GPU-equipped inference node, the recall happens in parallel with model compute and the overhead is effectively hidden.

The lookup scales with knowledge-base size sublinearly — the matrix-memory layer is one matrix-vector product against the stored pattern set, not a search across vectors. Larger knowledge bases cost more per lookup, but the relationship is gentle.


Two modes

Prompt-boundary recall today; per-token recall in preview.

Two operating modes ship in 5.0:


How to enable

Opt-in per request, or per tenant.

Two paths:

Default is off everywhere. Enabling the feature does not change the model, the prompt, or the response shape — only the decoder's view of your data.


Honest scope

What this is not.


Next.

For the other memory-layer preview features (compressed memory, distilled router), read advanced retrieval. For the platform's overall data posture, read your data. To install: get started. Questions: office@eldric.ai.