Preview in 5.0.x

Ingest that reads
the schema.

The intelligent upload dialog gains a schema-aware ingest substrate. For structured content — tables, CSVs, JSON, XML schemas — the wizard reads the schema before chunking and proposes a strategy that respects record boundaries instead of slicing through them. Ingesting structured data stops requiring per-document overrides.

Next patch


Where today's chunking falls short

Records cut in half.

The 5.0 knowledge-base ingest path applies the same chunking strategy to every document. For prose — PDFs, reports, transcripts — that works well. For structured content, it often doesn't.

A 50-column CSV with 10,000 rows, chunked by token count, produces chunks that contain partial rows: the first half of row 1,247 ends one chunk, the second half starts the next. The model can recall “something about row 1,247” without ever seeing the whole row in one chunk. The same problem hits JSON documents (records split mid-object) and XML feeds (elements truncated).

The 5.0 work-around: a per-document chunking override. Power users do it; most customers don't, and end up with degraded recall on their structured data.


What's coming in 5.0.x

Schema first, chunk second.


What's pending

Honest gates on this page.

Still in flight

  • Schema sniffers for CSV / TSV / JSON / NDJSON / XML / Parquet
  • Record-boundary chunking implementations per format
  • Per-chunk metadata schema (column names, types, source row index)
  • Wizard UI proposal step with override controls
  • Migration path for existing knowledge bases (admin opt-in re-ingest)

This page updates as each piece lands. The release notes are the formal cut.


Read next.

For RAG today, see using RAG and chunking strategies. For the full 5.0.x roadmap, see what's next in 5.0.x.