Smart memory inference (preview)

The model
consults your memory.

A preview feature in 5.0 wires Eldric's native inference to your knowledge-base memory directly. Before the model answers, it pulls relevant context from your installation's matrix memory; the answer is grounded in your own data without a separate retrieval round-trip. Opt-in per request, sub-2 ms per-token overhead, Pro+ tier.

What it does

Recall, in the decoder.

The standard pattern for grounding a model in your own data is two steps: a retrieval call (knowledge-base search) feeds the model the relevant context, then the model answers. Smart memory inference collapses that into one: Eldric's native inference already has access to the matrix-memory layer of the knowledge base your tenant points at; at the prompt boundary, the relevant patterns are consulted and merged into the model's decoding state before it produces the first token.

What changes for the user:

The model has your context already. It cites your documents, uses your terminology, references prior conversations within the same tenant — without an explicit retrieval call wrapping the chat.
One round-trip, not two. Useful when total latency matters: the agent worker doesn't need a separate /search step for grounding when the answer is one matrix-memory query away.
Quality on customer-specific content. For workloads where the answer pivots on the institution's own documents — clinical guidelines, contract templates, machining records — the model's first-pass answer reads like an answer that read your data, not a generic answer asking you to clarify.

Performance

What the overhead looks like.

On our demo cluster (CPU-only on the controller host), the recall lookup adds 1–2 milliseconds per token at typical knowledge-base sizes. For a 300-token answer, that's an extra fraction of a second of decode time. For workloads on a GPU-equipped inference node, the recall happens in parallel with model compute and the overhead is effectively hidden.

The lookup scales with knowledge-base size sublinearly — the matrix-memory layer is one direct lookup against the stored pattern set, not a search across vectors. Larger knowledge bases cost more per lookup, but the relationship is gentle.

Two modes

Prompt-boundary recall today; per-token recall in preview.

Two operating modes ship in 5.0:

Prompt-boundary recall. Eldric's native inference consults the matrix memory once at the start of a response, merging the recalled context into the model's state before decoding begins. Lowest overhead; appropriate for most chat-style workloads. Ships as GA on Pro+.
Per-token recall (preview). The daemon re-consults the memory at every decoded token. Useful for very long answers where the relevant context shifts mid-answer (e.g., a multi-section document summary, a step-by-step procedure that pivots on intermediate results). Higher overhead; ships as preview through the 5.0 line, GA target a later 5.0.x patch.

How to enable

Opt-in per request, or per tenant.

Two paths:

Admin opt-in per tenant. Admin Console → Tenants → pick a tenant → Smart Memory Inference → choose namespace + mode. Once enabled, requests from that tenant automatically use the feature.
Per-request flag. Pass smart_memory: true on the chat-completion request body to opt that specific call in. Useful when you want the feature only for certain workloads (long-form generation, customer-specific reports) and not for short generic queries.

Default is off everywhere. Enabling the feature does not change the model, the prompt, or the response shape — only the decoder's view of your data.

Honest scope

What this is not.

Not a replacement for explicit retrieval. When you want sources cited in the answer with verifiable links back to documents, run the standard knowledge-base search pattern. Smart memory inference is a recall mechanism, not a citation mechanism.
Not on by default. Standard workloads stay on the standard inference path. The feature ships preview status through the 5.0 line; we want customer feedback before it becomes the new default.
Not a cross-tenant feature. The memory the decoder reaches is the one scoped to the active tenant. The decoder never sees another tenant's data, by construction.
Local only. The recall step runs on your hardware, in your installation. There's no cloud component, no external lookup, no data leaving your network.

Next.

For the other memory-layer preview features (compressed memory, distilled router), read advanced retrieval. For the platform's overall data posture, read your data. To install: get started. Questions: office@eldric.ai.

The modelconsults your memory.