Chunking strategies

Different content,
different chunks.

A scientific paper isn't the same shape as a sensor stream. A CSV isn't the same shape as a Python source file. A 90-minute video isn't the same shape as a one-page memo. Eldric ships content-aware chunking — the platform detects what you uploaded, picks an appropriate strategy by default, and lets you adjust before committing. Better chunks at ingest is better RAG hits at query time.


Intelligent upload

Eldric suggests; you confirm.

When you drag a file into the chat shell or a knowledge-base management page, the upload flow doesn't immediately commit. The platform inspects the file first — content type, language, length, structure — then opens a suggestion dialog with parameters pre-filled. You see what's about to happen, you can override anything, and only when you click Commit to RAG does the ingestion actually fire.

What the suggestion dialog shows you, per file:

Click Preview chunks to see the first 5–10 chunks the strategy would produce. Adjust the strategy and re-preview to compare. Click Commit to RAG when satisfied.


The defaults

Suggested strategy per content type.

Defaults below. Every value is overrideable in the upload dialog and persistable per knowledge base.

Content typeStrategyChunk sizeOverlapAuto-enrichment
Scientific PDFsemantic (per-section)512 tokens50authors, DOI, refs, entities
Markdown / docssemantic (heading-boundary)384 tokens40headings, code blocks, cross-links
Code (Python, C++, JS, …)function-boundary1024 tokens100symbols, imports, docstrings
CSV / TSVper-row or per-clusterrow-natural0column stats, value distributions
Audioper-utterance after STTn/an/atranscript, speaker diarization, timestamps
Videoper-scene after scene detectionn/an/ascene detection, frame samples, transcript
Imageper-imagen/an/avision embedding, description, OCR text
Sensor time-seriesper-window5 minutes30 secondsanomaly tags, trend direction, range
Genomic FASTAper-sequencen/an/agene annotation, GC content, ORF
Chemical SMILESper-moleculen/an/aproperties, ADMET, similar compounds
Plain textfixed512 tokens50language, keyword extraction
Binary / unknownmetadata-onlyn/an/afilename, size, magic bytes, LLM description

Strategies explained

What each strategy does.

Semantic

Splits on natural boundaries — paragraphs, sections, headings — then merges short adjacent pieces until each chunk is close to the target token count. Best for documents where meaning sits inside section boundaries: scientific papers, contracts, policy manuals.

Fixed

Cuts at the target token count regardless of structure, with overlap to avoid losing meaning across boundaries. Best when the input has no useful structure — long plain-text logs, transcripts without speaker turns, OCR'd images where layout is gone.

Function-boundary

For source code. Splits at function / class / method boundaries, with the function signature carried into each chunk so retrieval can match "where is the validate_input function" to the actual implementation.

Per-row / per-cluster

For tabular data. Per-row treats each row as a chunk; per-cluster groups rows by similarity (handy for sensor data where 1000 rows might be one "operating regime"). Column statistics ride along as metadata so queries against columns work.

Per-utterance / per-scene

For audio and video. The media worker transcribes / segments first, then each utterance (for audio) or scene (for video) becomes a chunk with the transcript and timestamps attached. Lets you query "who said X around the 12-minute mark" and get an answer that points at the right scene.

Per-window

For sensor time-series. Slides a window across the stream, summarises each window into a chunk with anomaly tags + trend direction + value range. Good for IoT and SCADA data where the structure is "5 minutes of one shift", "5 minutes of another shift", and you want to query against operating modes.

Metadata-only

Fallback for content types the platform can't extract text from — binaries, encrypted archives, raw firmware images. Stores filename, size, magic-byte signature and an LLM-generated description of the file's role, so the file is searchable even when its contents aren't.


Per-knowledge-base config

Picking a strategy across the whole KB.

Suggestion-and-confirm is the per-file flow. For a knowledge base where every document is the same shape — for instance, a KB of clinical-guideline PDFs — set the strategy once at the KB level:

curl -X POST -H "X-API-Key: $ELDRIC_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"chunk_size":512,"chunk_overlap":50,"strategy":"semantic"}' \
     https://<your-host>/api/v1/vector/namespaces/<tenant>/<ns>/config

From this point on, uploads into that KB skip the suggestion dialog (or show it with the KB defaults pre-filled). Re-embedding the existing documents after a strategy change is a one-button operation in the admin console.


Why this matters

Better chunks, better retrieval.

Generic 512-token-overlap-50 across everything works, but it works least for content where the meaningful unit is something else — a sentence in a contract, a function in code, an utterance in an interview, a row in a CSV. Content-aware chunking is the single change that takes RAG from "sometimes finds the right thing" to "reliably finds the right thing", because the units stored in the vector index match the units a query is actually about.

Combined with the EMM compressed-retrieval preview and the smart memory inference preview, the chunking layer is the foundation of Eldric's retention loop: high-quality chunks at ingest → better RAG hits → better user acceptance signal → more useful training corpus → smarter platform over time.


Going further

Next.

For the customer-facing how-to: using RAG. For the architecture view: RAG architecture. For the cascade behaviour (ENRN → EMM → RAG → live source): RAG on demand. For custom classification — teaching Eldric your own intent classes — see custom classification.