Data access · architecture
Connect to everything — NFS, SQL, z/OS, APIs
Enterprise data doesn’t live in one place. It’s on NFS mounts somebody set up in 2009, in PostgreSQL and MySQL behind a dozen apps, in a z/OS Db2 warehouse that prints the monthly close, in SharePoint, in object storage, in vendor REST APIs. Most AI demos handle this by assuming you’ll ingest all of it into a vector database. You won’t. You’d lose freshness, ACLs, referential integrity, and the patience of every DBA in the building.
Eldric AI OS takes the other route: connect where the data lives. NFS stays NFS. SQL stays SQL. z/OS stays z/OS. Every source is a plugin under a single retrieval contract; one chat query fans out across all of them in parallel; citations travel back through the stack so users and auditors can always ask “where did that come from?”.
This post is about the contract — what it looks like, what ships in alpha.3 (more than you’d expect, thanks to unixODBC), and the honest roadmap for everything else.
The problem with “ingest everything”
The fastest way to demo a RAG system is to crawl a share, embed every document, drop them in a vector DB, and ask questions. It’s also the fastest way to ship something that doesn’t survive first contact with a real enterprise:
- Freshness dies. The database of record changes every minute; the embedding re-index runs nightly. Answers drift out of sync with reality.
- ACLs die. Once a document lands in a vector store, the original access-control model is gone. HR data, unredacted contracts, closed-deal Salesforce records — all uniformly retrievable by anyone with chat access.
- Operational DBs die. You’re not going to re-ingest a 300 GB z/OS Db2 warehouse into a FAISS index. Nor should you — Db2 is already indexed, already backed up, already audited.
- The vendor API changes tomorrow. Cached copies of a SharePoint tree go stale the day someone renames a folder.
The principle Eldric settles on: the database of record is the database of record. Query it live; answer with citations.
The contract — retrieval.<plugin-id>
Every data source in Eldric answers the same tiny contract.
The Edge module fans a chat query out to every enabled
data.* plugin and tries two backends in order:
- An in-process syscall named
retrieval.<plugin-id>, if the owning module registered one. Sub-microsecond dispatch, no HTTP. - An extension bridge — the plugin’s
config carries
{extension, tool}and Edge invokes the loaded extension over the kernel’sextension.invoke_toolsyscall, which POSTs to the extension’s/invoke.
Either path returns the same JSON shape:
{
"snippets": [
{"text": "...", "source": "nfs://plant-docs/sop/1042.pdf#p7", "score": 0.81},
{"text": "...", "source": "postgres://mes.prod/lots/DQ_103", "score": 0.73}
]
}
That’s the whole API. Any source — a filesystem, a SQL database, an ODBC DSN to a mainframe, a REST API, a vector store, a hand-written Python script talking to FTP — that can produce this shape is a first-class peer in the retrieval fan-out.
What ships in alpha.3 today
The eldric-aios RPM on Fedora 42+ declares
BuildRequires: unixODBC-devel and
Requires: unixODBC. Not decoration — the data
module links the ODBC client at build time and the shipped
retrieval.data.local syscall routes to any
configured DSN via SQLDriverConnect /
SQLExecDirect / SQLFetch. The driver
set is the admin’s choice: install whichever
*-odbc package you need and add a stanza to
/etc/odbcinst.ini.
That one path covers most of what an enterprise actually needs:
| Backend | How | Status |
|---|---|---|
| Filesystem / NFS | POSIX access inside the data module; nfs-ganesha integration for serving exports or mounting remote shares. Cited paths are the same paths ops can cat from a shell. |
shipped |
| SQLite | Always linked; default backend for small ops metadata. | shipped |
| PostgreSQL, MySQL, MariaDB | Via the ODBC layer — admin installs postgresql-odbc or mysql-connector-odbc, adds a DSN, Eldric queries live. |
shipped |
| Oracle, MSSQL | Same path — oracle-instantclient-odbc or Microsoft’s msodbcsql18. |
shipped |
| IBM Db2 LUW | IBM’s free DSDriver registers a unixODBC driver; Eldric talks to Db2 LUW through it. | shipped |
| IBM Db2 z/OS (mainframe) | Same DSDriver speaks DRDA to the LPAR over port 446 / 50000. With a DB2 Connect license on the Linux host, the AI sees a z/OS warehouse as another ODBC DSN. | shipped |
| Vector store + Matrix Memory | Merged with the above into one retrieval.data.local answer: exact-retrieval via the vector side, pattern-recall via the matrix-memory side. |
shipped |
data.arxiv, data.nasa_apod |
Reference extensions at sdk/extension/examples/ — each is ~80 lines of Python fulfilling the retrieval.<id> contract via the bridge. Templates for the rest of the 4.x science surface. |
shipped |
data.pageindex |
Vectorless / reasoning-based retrieval — hierarchical TOC + LLM navigation. Useful on structured professional docs (SEC filings, FDA submissions, legal, textbooks) where vector similarity loses to expert tree-walking. | sketch |
That means “can Eldric talk to our DB2 z/OS warehouse?” is already a yes on alpha.3, as long as an admin installs IBM’s ODBC driver. No Phase-2 wait.
What’s on the roadmap planned
Plenty of enterprise data doesn’t speak ODBC. Those are the actual gaps.
Phase 1 — streaming, NoSQL, object storage
Headers for these connectors exist in
cpp/include/distributed/data/connectors/; each
becomes a real retrieval.data.<name> syscall
when its driver is linked and the query path lands.
| Backend | Driver | Notes |
|---|---|---|
| MongoDB | mongocxx (Apache-2.0) |
Document store, aggregation pipeline. |
| Kafka | librdkafka (BSD-2) |
Streaming ingest, topic consumption. |
| Elasticsearch / OpenSearch | libcurl (REST) | Search engine + vector-store fallback. |
| ClickHouse | clickhouse-cpp (Apache-2.0) |
Column-oriented OLAP, analytical queries. |
| MinIO / S3 | aws-sdk-cpp (S3 module) |
Object storage, data-lake entry. |
Phase 2 — native mainframe (Enterprise Tier)
The ODBC path already covers query-side access to Db2 LUW + Db2 z/OS. Phase-2 is about the rest of the mainframe surface — messaging, legacy record stores, transaction gateways — plus a native DB2 CLI path for customers who want DRDA without going through unixODBC.
| Backend | Driver | Protocol / use |
|---|---|---|
| IBM MQ | IBM MQ C client libmqm (dlopen) |
MQI protocol, port 1414 — enterprise messaging backbone. |
| VSAM | REST via z/OS Connect EE | HTTP/JSON — customer deploys the gateway side. |
| IMS | IBM Universal DB driver | IMS Connect over TCP; DL/I navigation or SQL abstraction. |
| CICS | CICS Liberty (REST) | HTTP/JSON — transaction invocation, not a data query. |
| Native DB2 CLI (DRDA) | IBM DB2 CLI via runtime dlopen |
Direct DRDA path for customers who don’t want the unixODBC indirection. |
All IBM proprietary drivers are loaded via
dlopen/dlsym so
eldric-aios compiles and runs without them; the
mainframe paths light up when the customer installs the IBM
client on the Linux host. Enterprise tier licensing gates the
Phase-2 set.
Phase 3 (Cassandra, HBase, Hive, Snowflake, Redshift,
BigQuery, Azure Synapse, Databricks Delta Lake, Spark Connect,
Druid) follows the same pattern — each an
implementation against the common DataSource
interface. Snowflake and Synapse are already reachable through
their ODBC drivers today; Phase-3 is about native protocols
where they outperform ODBC.
Everything else — the extension path works today
What about sources that don’t speak ODBC and aren’t on the Phase-1 list? SharePoint, Salesforce, ServiceNow, an FTP drop, a vendor REST API, a line-of-business SOAP service — the normal enterprise long tail.
The retrieval contract is universal, so the same manifest +
~80-line Python template the arxiv reference
extension uses handles any of them:
cat > ${ELDRIC_DATA_DIR}/extensions/sharepoint_corp.extension.yaml <<YAML
extension:
name: sharepoint_corp
display_name: Corporate SharePoint
category: data # ← makes it a plugin, auto-surfaced in chat
model: B # external Python process
external_url: http://127.0.0.1:9600
tools: [search]
YAML
curl -XPOST http://localhost:8880/api/v1/extensions/load \
-d '{"name":"sharepoint_corp"}'
# data.sharepoint_corp is now a toggle in the chat sidebar. No Edge rebuild.
The Python side is a thin shim that takes
{query, top_k, config}, calls the vendor’s
SharePoint Search API, and returns the
{snippets:[{text,source,score}]} contract. When a
future Phase bundles SharePoint as a native connector, it
ships its own syscall — but users see no change: a
toggle lights up in their sidebar, the fan-out includes one more
source, citations include a Db2 row reference.
That’s the payoff of making the retrieval contract the surface: “built-in” and “customer-written” look identical to the rest of the system. The C++ roadmap replaces glue code with optimised drivers without moving the contract.
Fan-out — one query, every enabled source
Turning individual connectors into useful AI is the other half
of the story. Flipping three toggles in the sidebar
— data.local (which today may include a
SharePoint DSN or a Db2 z/OS DSN via ODBC),
data.arxiv, and a customer-written
data.sharepoint_corp extension — tells the
Edge module to fan a single chat query out to all three
in parallel, merge the returned snippets, and
synthesise a system message prefixed with retrieval context
before the LLM sees anything.
Every response carries an X-Eldric-Fanout header
listing which plugins answered, with what count, and whether
the backend was a native syscall, a loaded extension, or
no-backend (toggle enabled but nothing wired).
Admins see wiring gaps immediately; users see
“retrieved from: Local KB (3), arXiv (5), SharePoint (2)”
under the assistant message and can click through for the
source list.
What this means for a private-cloud deployment
Three decisions flow from this pattern:
- Eldric lives on-prem. No data leaves the environment; no source system gets a replica in a third-party store. The LLM itself runs on the same host via embedded llama.cpp.
- Existing ACLs keep working. Eldric authenticates as a service principal to each source. Source-side security policies apply verbatim. Per-tenant isolation in Eldric sits on top of — not instead of — source-side rules.
- Adding a source is a manifest drop. A new vendor, a new database, a new mainframe region — same ~80-line Python template, same YAML manifest, same toggle appearing in every user’s sidebar. No fork, no rebuild, no user re-training.
If your organisation’s data is spread across NFS and SQL and Db2 and SharePoint and half a dozen vendor APIs — the normal state of a grown-up enterprise — this is the shape of a private-cloud AI assistant that’s actually deployable: small kernel, one retrieval contract, lots of plugins, no big-bang ingest.