Appearance
Elsai Text Extractors
Package: elsai-text-extractors v0.1.0
Extracts text content from PDF, DOCX, CSV, Excel, and other document formats. Each extractor takes the file path at construction time and exposes a dedicated extraction method.
For LLM-powered natural language querying over Excel files, see ExcelParser (separate package).
Installation
bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-text-extractors/ elsai-text-extractors==0.1.0Requirements: Python >= 3.9
Available extractors
| Class | Package | Input format | Method |
|---|---|---|---|
CSVFileExtractor | elsai-text-extractors | CSV | load_from_csv() |
DocxTextExtractor | elsai-text-extractors | DOCX | extract_text_from_docx() |
PyPDFTextExtractor | elsai-text-extractors | extract_text_from_pdf() | |
UnstructuredExcelLoaderService | elsai-text-extractors | XLSX | load_excel() |
MarkItDownTextExtractor | elsai-text-extractors | PDF, DOCX, PPTX, and more | extract_text_from_file() |
ExcelParser | elsai-parsers | XLSX (LLM-powered) | parse() |
CSVFileExtractor
Loads and reads content from CSV files. The file path is passed at construction time.
python
from elsai_text_extractors.csv_extractor import CSVFileExtractor
extractor = CSVFileExtractor(file_path="sample_data/sample.csv")
content = extractor.load_from_csv()
print(content)Constructor parameters:
| Parameter | Description |
|---|---|
file_path | Path to the CSV file |
DocxTextExtractor
Extracts plain text from Microsoft Word (.docx) files.
python
from elsai_text_extractors.docx_extractor import DocxTextExtractor
extractor = DocxTextExtractor(file_path="sample_data/sample.docx")
content = extractor.extract_text_from_docx()
print(content)Constructor parameters:
| Parameter | Description |
|---|---|
file_path | Path to the .docx file |
PyPDFTextExtractor
Extracts text from PDF files using PyPDF.
python
from elsai_text_extractors.pypdfloader import PyPDFTextExtractor
extractor = PyPDFTextExtractor(file_path="sample_data/sample.pdf")
content = extractor.extract_text_from_pdf()
print(content)Constructor parameters:
| Parameter | Description |
|---|---|
file_path | Path to the PDF file |
UnstructuredExcelLoaderService
Loads content from Excel (.xlsx) files using the Unstructured library.
python
from elsai_text_extractors.unstructured_excel_loader import UnstructuredExcelLoaderService
extractor = UnstructuredExcelLoaderService(file_path="sample_data/sample.xlsx")
content = extractor.load_excel()
print(content)Constructor parameters:
| Parameter | Description |
|---|---|
file_path | Path to the .xlsx file |
MarkItDownTextExtractor
Converts a wide range of document formats into clean Markdown text using the MarkItDown library. Useful when you need consistent text output across mixed document types.
Supported formats: PDF, DOCX, PPTX, XLSX, and other formats supported by MarkItDown.
python
from elsai_text_extractors.markitdown import MarkItDownTextExtractor
extractor = MarkItDownTextExtractor(file_path="/path/to/document.pdf")
content = extractor.extract_text_from_file()
print(content)Constructor parameters:
| Parameter | Description |
|---|---|
file_path | Path to the document file |
ExcelParser
Package: elsai-parsers v0.1.0
Queries Excel files using natural language. Unlike the extractors above which return raw text, ExcelParser indexes the Excel data into a vector database and uses an LLM to answer questions about it — including returning structured JSON responses.
bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-parsers/ elsai-parsers==0.1.0python
from elsai_parsers.excel_parser import ExcelParser
from elsai_model.azure_openai import AzureOpenAIConnector
from elsai_embeddings.azure_embeddings import AzureOpenAIEmbeddingModel
from elsai_vectordb.chromadb import ChromaVectorDb
llm = AzureOpenAIConnector(...)
embedding_model = AzureOpenAIEmbeddingModel(...)
vector_db = ChromaVectorDb(persist_directory="./excel_db")
parser = ExcelParser(
file_path="data/sales_report.xlsx",
config={
"vector_database": vector_db,
"embedding_function": embedding_model,
"llm": llm,
"vector_store_type": "chroma",
},
)
# Natural language query — returns plain text
result = parser.parse(user_prompt="What is the total revenue for Q3?")
print(result)
# With a JSON template — returns structured dict
result = parser.parse(
user_prompt="List the top 5 products by sales.",
json_template='{"products": [{"name": "", "sales": 0}]}',
)
print(result)Constructor parameters:
| Parameter | Description |
|---|---|
file_path | Path to the Excel (.xlsx) file |
config | Configuration dictionary (see below) |
config dictionary keys:
| Key | Description |
|---|---|
vector_database | Vector database instance for indexing Excel content (e.g. ChromaVectorDb) |
embedding_function | Embedding model instance for vectorizing content |
llm | LLM connector instance for answering queries |
vector_store_type | String identifier for the vector store type (e.g. "chroma") |
parse() parameters:
| Parameter | Description |
|---|---|
user_prompt | Natural language question about the Excel data |
json_template | Optional JSON string template — when provided, the response is structured to match the template schema |
Returns a plain text string when json_template is omitted, or a structured dict when a template is provided.
Version history
| Version | Changes |
|---|---|
| 0.1.0 | Initial release — CSV, DOCX, PDF, Excel, MarkItDown extractors |