All posts

Feb 15, 20264 min read2 reads

RAG Quality Is Chunking Quality

Before you tune prompts or swap embedding models, look at how you split documents. It is probably the whole problem.

RAG Quality Is Chunking Quality cover

Every RAG debugging session I have run for TypeFlow AI ends in the same place: the retrieval was fine, the model was fine, the chunks were garbage.

The failure mode

Fixed-size token windows cut arguments in half. The embedding of half an argument points somewhere useless in vector space, so retrieval surfaces a chunk that is lexically related and semantically broken. The model then does what models do with broken context: it improvises.

What worked

Splitting on semantic boundaries instead of token counts. Headings, list boundaries, and argument turns become chunk edges. Chunks vary wildly in size and that is fine; a coherent 800-token chunk beats two incoherent 400-token ones every time.

Two more compounding wins:

  • Prefilter cheaply, rank expensively. A keyword prefilter before pgvector similarity cut latency enough to make autocomplete viable.
  • Cite everything. When every answer shows its chunks, users debug your retrieval for you. The worst chunks get reported within days.

The takeaway

Embedding model choice moved my answer quality a few percent. Chunking strategy moved it more than everything else combined.

  • RAG
  • Vector Search
  • Supabase