Build a production-grade Retrieval-Augmented Generation pipeline for Turkish documents.
15 min readUse recursive character splitting at 512 tokens with 10% overlap. For Turkish: avoid splitting mid-sentence — the language has long compound words that span chunk boundaries.
For Turkish: multilingual-e5-large outperforms text-embedding-3-small on Turkish retrieval benchmarks. Self-host via Sentence-Transformers for cost control.
Use pgvector on your Basefyio (Basefyio) instance — no additional infra. Enable HNSW index for <50ms retrieval at 1M vectors.
CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
Combine dense (vector) and sparse (BM25/tsvector) retrieval. For Turkish, FTS with "turkish" dictionary handles morphological variants — büyümek matches büyüme, büyüdü, etc.
Cross-encoder reranking (bge-reranker-v2-m3) reduces hallucination by surfacing the most relevant chunks. Run reranker on top-20, pass top-5 to LLM.