Appearance
Chunking
Package: elsai-utilities v0.2.0
Helper classes for chunking and converting documents in RAG and vector database ingestion pipelines.
Installation
bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-utilities/ elsai-utilities==0.2.0Requirements: Python >= 3.9
DocumentChunker
Splits text into structured chunks using multiple strategies.
python
from elsai_utilities.splitters import DocumentChunker
chunker = DocumentChunker()Strategies
Recursive character splitter
python
chunks = chunker.split_by_characters(
text="Long document text here...",
chunk_size=500,
chunk_overlap=50,
)Token-based splitter
python
chunks = chunker.split_by_tokens(
text="Long document text here...",
chunk_size=256,
chunk_overlap=20,
)Markdown/header-aware splitter
python
content = "# Section 1\n\nContent here.\n\n## Subsection\n\nMore content."
chunks = chunker.split_by_headers(content)Semantic splitter
Groups sentences by meaning rather than size.
python
chunks = chunker.split_by_semantics(
text="Long document text here...",
embedding_model=your_embedding_model,
threshold=0.8,
)DocumentConverter
Converts raw extracted text into structured document objects ready for vector ingestion.
python
from elsai_utilities.converters import DocumentConverter
converter = DocumentConverter()
documents = converter.convert(
text="Raw extracted text from a PDF...",
metadata={"source": "report.pdf", "page": 1},
)Full RAG pipeline example
python
from elsai_utilities.splitters import DocumentChunker
from elsai_utilities.converters import DocumentConverter
from elsai_embeddings.azure_openai import AzureOpenAIEmbeddingModel
from elsai_vectordb.chromadb import ChromaVectorDb
# 1. Chunk the document
chunker = DocumentChunker()
chunks = chunker.split_by_characters(raw_text, chunk_size=500, chunk_overlap=50)
# 2. Convert to document objects
converter = DocumentConverter()
documents = [converter.convert(chunk, metadata={"source": "report.pdf"}) for chunk in chunks]
# 3. Embed and store
embedding_model = AzureOpenAIEmbeddingModel(...)
db = ChromaVectorDb(persist_directory="./db")
db.create_if_not_exists("my_collection")
for doc in documents:
vector = embedding_model.embed_query(doc.page_content)
db.add_document({**doc.dict(), "embeddings": vector}, collection_name="my_collection")