A scientific paper isn't the same shape as a sensor stream. A CSV isn't the same shape as a Python source file. A 90-minute video isn't the same shape as a one-page memo. Eldric ships content-aware chunking — the platform detects what you uploaded, picks an appropriate strategy by default, and lets you adjust before committing. Better chunks at ingest is better RAG hits at query time.
When you drag a file into the chat shell or a knowledge-base management page, the upload flow doesn't immediately commit. The platform inspects the file first — content type, language, length, structure — then opens a suggestion dialog with parameters pre-filled. You see what's about to happen, you can override anything, and only when you click Commit to RAG does the ingestion actually fire.
What the suggestion dialog shows you, per file:
Click Preview chunks to see the first 5–10 chunks the strategy would produce. Adjust the strategy and re-preview to compare. Click Commit to RAG when satisfied.
Defaults below. Every value is overrideable in the upload dialog and persistable per knowledge base.
| Content type | Strategy | Chunk size | Overlap | Auto-enrichment |
|---|---|---|---|---|
| Scientific PDF | semantic (per-section) | 512 tokens | 50 | authors, DOI, refs, entities |
| Markdown / docs | semantic (heading-boundary) | 384 tokens | 40 | headings, code blocks, cross-links |
| Code (Python, C++, JS, …) | function-boundary | 1024 tokens | 100 | symbols, imports, docstrings |
| CSV / TSV | per-row or per-cluster | row-natural | 0 | column stats, value distributions |
| Audio | per-utterance after STT | n/a | n/a | transcript, speaker diarization, timestamps |
| Video | per-scene after scene detection | n/a | n/a | scene detection, frame samples, transcript |
| Image | per-image | n/a | n/a | vision embedding, description, OCR text |
| Sensor time-series | per-window | 5 minutes | 30 seconds | anomaly tags, trend direction, range |
| Genomic FASTA | per-sequence | n/a | n/a | gene annotation, GC content, ORF |
| Chemical SMILES | per-molecule | n/a | n/a | properties, ADMET, similar compounds |
| Plain text | fixed | 512 tokens | 50 | language, keyword extraction |
| Binary / unknown | metadata-only | n/a | n/a | filename, size, magic bytes, LLM description |
Splits on natural boundaries — paragraphs, sections, headings — then merges short adjacent pieces until each chunk is close to the target token count. Best for documents where meaning sits inside section boundaries: scientific papers, contracts, policy manuals.
Cuts at the target token count regardless of structure, with overlap to avoid losing meaning across boundaries. Best when the input has no useful structure — long plain-text logs, transcripts without speaker turns, OCR'd images where layout is gone.
For source code. Splits at function / class / method boundaries, with the function signature carried into each chunk so retrieval can match "where is the validate_input function" to the actual implementation.
For tabular data. Per-row treats each row as a chunk; per-cluster groups rows by similarity (handy for sensor data where 1000 rows might be one "operating regime"). Column statistics ride along as metadata so queries against columns work.
For audio and video. The media worker transcribes / segments first, then each utterance (for audio) or scene (for video) becomes a chunk with the transcript and timestamps attached. Lets you query "who said X around the 12-minute mark" and get an answer that points at the right scene.
For sensor time-series. Slides a window across the stream, summarises each window into a chunk with anomaly tags + trend direction + value range. Good for IoT and SCADA data where the structure is "5 minutes of one shift", "5 minutes of another shift", and you want to query against operating modes.
Fallback for content types the platform can't extract text from — binaries, encrypted archives, raw firmware images. Stores filename, size, magic-byte signature and an LLM-generated description of the file's role, so the file is searchable even when its contents aren't.
Suggestion-and-confirm is the per-file flow. For a knowledge base where every document is the same shape — for instance, a KB of clinical-guideline PDFs — set the strategy once at the KB level:
curl -X POST -H "X-API-Key: $ELDRIC_API_KEY" \
-H "Content-Type: application/json" \
-d '{"chunk_size":512,"chunk_overlap":50,"strategy":"semantic"}' \
https://<your-host>/api/v1/vector/namespaces/<tenant>/<ns>/config
From this point on, uploads into that KB skip the suggestion dialog (or show it with the KB defaults pre-filled). Re-embedding the existing documents after a strategy change is a one-button operation in the admin console.
Generic 512-token-overlap-50 across everything works, but it works least for content where the meaningful unit is something else — a sentence in a contract, a function in code, an utterance in an interview, a row in a CSV. Content-aware chunking is the single change that takes RAG from "sometimes finds the right thing" to "reliably finds the right thing", because the units stored in the vector index match the units a query is actually about.
Combined with the EMM compressed-retrieval preview and the smart memory inference preview, the chunking layer is the foundation of Eldric's retention loop: high-quality chunks at ingest → better RAG hits → better user acceptance signal → more useful training corpus → smarter platform over time.
For the customer-facing how-to: using RAG. For the architecture view: RAG architecture. For the cascade behaviour (ENRN → EMM → RAG → live source): RAG on demand. For custom classification — teaching Eldric your own intent classes — see custom classification.