Skip to content

Elsai Text Extractors

Package: elsai-text-extractors  v0.1.0

Extracts text content from PDF, DOCX, CSV, Excel, and other document formats. Each extractor takes the file path at construction time and exposes a dedicated extraction method.

For LLM-powered natural language querying over Excel files, see ExcelParser (separate package).

Installation

bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-text-extractors/ elsai-text-extractors==0.1.0

Requirements: Python >= 3.9


Available extractors

ClassPackageInput formatMethod
CSVFileExtractorelsai-text-extractorsCSVload_from_csv()
DocxTextExtractorelsai-text-extractorsDOCXextract_text_from_docx()
PyPDFTextExtractorelsai-text-extractorsPDFextract_text_from_pdf()
UnstructuredExcelLoaderServiceelsai-text-extractorsXLSXload_excel()
MarkItDownTextExtractorelsai-text-extractorsPDF, DOCX, PPTX, and moreextract_text_from_file()
ExcelParserelsai-parsersXLSX (LLM-powered)parse()

CSVFileExtractor

Loads and reads content from CSV files. The file path is passed at construction time.

python
from elsai_text_extractors.csv_extractor import CSVFileExtractor

extractor = CSVFileExtractor(file_path="sample_data/sample.csv")
content = extractor.load_from_csv()
print(content)

Constructor parameters:

ParameterDescription
file_pathPath to the CSV file

DocxTextExtractor

Extracts plain text from Microsoft Word (.docx) files.

python
from elsai_text_extractors.docx_extractor import DocxTextExtractor

extractor = DocxTextExtractor(file_path="sample_data/sample.docx")
content = extractor.extract_text_from_docx()
print(content)

Constructor parameters:

ParameterDescription
file_pathPath to the .docx file

PyPDFTextExtractor

Extracts text from PDF files using PyPDF.

python
from elsai_text_extractors.pypdfloader import PyPDFTextExtractor

extractor = PyPDFTextExtractor(file_path="sample_data/sample.pdf")
content = extractor.extract_text_from_pdf()
print(content)

Constructor parameters:

ParameterDescription
file_pathPath to the PDF file

UnstructuredExcelLoaderService

Loads content from Excel (.xlsx) files using the Unstructured library.

python
from elsai_text_extractors.unstructured_excel_loader import UnstructuredExcelLoaderService

extractor = UnstructuredExcelLoaderService(file_path="sample_data/sample.xlsx")
content = extractor.load_excel()
print(content)

Constructor parameters:

ParameterDescription
file_pathPath to the .xlsx file

MarkItDownTextExtractor

Converts a wide range of document formats into clean Markdown text using the MarkItDown library. Useful when you need consistent text output across mixed document types.

Supported formats: PDF, DOCX, PPTX, XLSX, and other formats supported by MarkItDown.

python
from elsai_text_extractors.markitdown import MarkItDownTextExtractor

extractor = MarkItDownTextExtractor(file_path="/path/to/document.pdf")
content = extractor.extract_text_from_file()
print(content)

Constructor parameters:

ParameterDescription
file_pathPath to the document file

ExcelParser

Package: elsai-parsers  v0.1.0

Queries Excel files using natural language. Unlike the extractors above which return raw text, ExcelParser indexes the Excel data into a vector database and uses an LLM to answer questions about it — including returning structured JSON responses.

bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-parsers/ elsai-parsers==0.1.0
python
from elsai_parsers.excel_parser import ExcelParser
from elsai_model.azure_openai import AzureOpenAIConnector
from elsai_embeddings.azure_embeddings import AzureOpenAIEmbeddingModel
from elsai_vectordb.chromadb import ChromaVectorDb

llm = AzureOpenAIConnector(...)
embedding_model = AzureOpenAIEmbeddingModel(...)
vector_db = ChromaVectorDb(persist_directory="./excel_db")

parser = ExcelParser(
    file_path="data/sales_report.xlsx",
    config={
        "vector_database": vector_db,
        "embedding_function": embedding_model,
        "llm": llm,
        "vector_store_type": "chroma",
    },
)

# Natural language query — returns plain text
result = parser.parse(user_prompt="What is the total revenue for Q3?")
print(result)

# With a JSON template — returns structured dict
result = parser.parse(
    user_prompt="List the top 5 products by sales.",
    json_template='{"products": [{"name": "", "sales": 0}]}',
)
print(result)

Constructor parameters:

ParameterDescription
file_pathPath to the Excel (.xlsx) file
configConfiguration dictionary (see below)

config dictionary keys:

KeyDescription
vector_databaseVector database instance for indexing Excel content (e.g. ChromaVectorDb)
embedding_functionEmbedding model instance for vectorizing content
llmLLM connector instance for answering queries
vector_store_typeString identifier for the vector store type (e.g. "chroma")

parse() parameters:

ParameterDescription
user_promptNatural language question about the Excel data
json_templateOptional JSON string template — when provided, the response is structured to match the template schema

Returns a plain text string when json_template is omitted, or a structured dict when a template is provided.


Version history

VersionChanges
0.1.0Initial release — CSV, DOCX, PDF, Excel, MarkItDown extractors

Copyright © 2026 Elsai Foundry.