Chunking strategies

Different content,
different chunks.

A scientific paper isn't the same shape as a sensor stream. A CSV isn't the same shape as a Python source file. A 90-minute video isn't the same shape as a one-page memo. Eldric ships content-aware chunking — the platform detects what you uploaded, picks an appropriate strategy by default, and lets you adjust before committing. Better chunks at ingest is better RAG hits at query time.

Intelligent upload

Eldric suggests; you confirm.

When you drag a file into the chat shell or a knowledge-base management page, the upload flow doesn't immediately commit. The platform inspects the file first — content type, language, length, structure — then opens a suggestion dialog with parameters pre-filled. You see what's about to happen, you can override anything, and only when you click Commit to RAG does the ingestion actually fire.

What the suggestion dialog shows you, per file:

Detected. Content type ("scientific paper · PDF · 18 pages · EN"), topic tags from a quick first-2-KB read, authors / DOI if the file carries them, estimated chunk count and index size.
Chunking. Suggested strategy + chunk size + overlap per the table below. Pre-selected — you can change.
Enrichment. Checkboxes for the metadata the platform will auto-extract (authors, DOI, entities, cross-references, topic tags, Q&A pairs for training). Defaults follow content type.
Keywords. Auto-extracted from the document; editable, removable.
Target. Which knowledge base to land in. "Create new" is in the picker.
Sharing. ACL — private / project / workgroup / public — per knowledge base.

Click Preview chunks to see the first 5–10 chunks the strategy would produce. Adjust the strategy and re-preview to compare. Click Commit to RAG when satisfied.

The defaults

Content type	Strategy	Chunk size	Overlap	Auto-enrichment
Scientific PDF	semantic (per-section)	512 tokens	50	authors, DOI, refs, entities
Markdown / docs	semantic (heading-boundary)	384 tokens	40	headings, code blocks, cross-links
Code (Python, C++, JS, …)	function-boundary	1024 tokens	100	symbols, imports, docstrings
CSV / TSV	per-row or per-cluster	row-natural	0	column stats, value distributions
Audio	per-utterance after STT	n/a	n/a	transcript, speaker diarization, timestamps
Video	per-scene after scene detection	n/a	n/a	scene detection, frame samples, transcript
Image	per-image	n/a	n/a	vision embedding, description, OCR text
Sensor time-series	per-window	5 minutes	30 seconds	anomaly tags, trend direction, range
Genomic FASTA	per-sequence	n/a	n/a	gene annotation, GC content, ORF
Chemical SMILES	per-molecule	n/a	n/a	properties, ADMET, similar compounds
Plain text	fixed	512 tokens	50	language, keyword extraction
Binary / unknown	metadata-only	n/a	n/a	filename, size, magic bytes, LLM description

What each strategy does.

Semantic

Splits on natural boundaries — paragraphs, sections, headings — then merges short adjacent pieces until each chunk is close to the target token count. Best for documents where meaning sits inside section boundaries: scientific papers, contracts, policy manuals.

Fixed

Cuts at the target token count regardless of structure, with overlap to avoid losing meaning across boundaries. Best when the input has no useful structure — long plain-text logs, transcripts without speaker turns, OCR'd images where layout is gone.

Function-boundary

For source code. Splits at function / class / method boundaries, with the function signature carried into each chunk so retrieval can match "where is the validate_input function" to the actual implementation.

Per-row / per-cluster

For tabular data. Per-row treats each row as a chunk; per-cluster groups rows by similarity (handy for sensor data where 1000 rows might be one "operating regime"). Column statistics ride along as metadata so queries against columns work.

Per-utterance / per-scene

For audio and video. The media worker transcribes / segments first, then each utterance (for audio) or scene (for video) becomes a chunk with the transcript and timestamps attached. Lets you query "who said X around the 12-minute mark" and get an answer that points at the right scene.

Per-window

For sensor time-series. Slides a window across the stream, summarises each window into a chunk with anomaly tags + trend direction + value range. Good for IoT and SCADA data where the structure is "5 minutes of one shift", "5 minutes of another shift", and you want to query against operating modes.

Metadata-only

Fallback for content types the platform can't extract text from — binaries, encrypted archives, raw firmware images. Stores filename, size, magic-byte signature and an LLM-generated description of the file's role, so the file is searchable even when its contents aren't.

Per-knowledge-base config

Picking a strategy across the whole KB.

Suggestion-and-confirm is the per-file flow. For a knowledge base where every document is the same shape — for instance, a KB of clinical-guideline PDFs — set the strategy once at the KB level:

curl -X POST -H "X-API-Key: $ELDRIC_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"chunk_size":512,"chunk_overlap":50,"strategy":"semantic"}' \
     https://<your-host>/api/v1/vector/namespaces/<tenant>/<ns>/config

From this point on, uploads into that KB skip the suggestion dialog (or show it with the KB defaults pre-filled). Re-embedding the existing documents after a strategy change is a one-button operation in the admin console.

Why this matters

Better chunks, better retrieval.

Generic 512-token-overlap-50 across everything works, but it works least for content where the meaningful unit is something else — a sentence in a contract, a function in code, an utterance in an interview, a row in a CSV. Content-aware chunking is the single change that takes RAG from "sometimes finds the right thing" to "reliably finds the right thing", because the units stored in the vector index match the units a query is actually about.

Combined with the EMM compressed-retrieval preview and the smart memory inference preview, the chunking layer is the foundation of Eldric's retention loop: high-quality chunks at ingest → better RAG hits → better user acceptance signal → more useful training corpus → smarter platform over time.

Different content,
different chunks.

Eldric suggests; you confirm.

Suggested strategy per content type.

What each strategy does.

Semantic

Fixed

Function-boundary

Per-row / per-cluster

Per-utterance / per-scene

Per-window

Metadata-only

Picking a strategy across the whole KB.

Better chunks, better retrieval.

Next.

Different content,different chunks.

Eldric suggests; you confirm.

Suggested strategy per content type.

What each strategy does.

Semantic

Fixed

Function-boundary

Per-row / per-cluster

Per-utterance / per-scene

Per-window

Metadata-only

Picking a strategy across the whole KB.

Better chunks, better retrieval.

Next.

Different content,
different chunks.