elsai Text Extractors

Package: elsai-text-extractors v0.1.0

Extracts text content from PDF, DOCX, CSV, Excel, and other document formats. Each extractor takes the file path at construction time and exposes a dedicated extraction method.

For LLM-powered natural language querying over Excel files, see ExcelParser (separate package).

Installation

bash

pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-text-extractors/ elsai-text-extractors==0.1.0

Requirements: Python >= 3.9

Available extractors

Class	Package	Input format	Method
`CSVFileExtractor`	`elsai-text-extractors`	CSV	`load_from_csv()`
`DocxTextExtractor`	`elsai-text-extractors`	DOCX	`extract_text_from_docx()`
`PyPDFTextExtractor`	`elsai-text-extractors`	PDF	`extract_text_from_pdf()`
`UnstructuredExcelLoaderService`	`elsai-text-extractors`	XLSX	`load_excel()`
`MarkItDownTextExtractor`	`elsai-text-extractors`	PDF, DOCX, PPTX, and more	`extract_text_from_file()`
`ExcelParser`	`elsai-parsers`	XLSX (LLM-powered)	`parse()`

CSVFileExtractor

Loads and reads content from CSV files. The file path is passed at construction time.

python

from elsai_text_extractors.csv_extractor import CSVFileExtractor

extractor = CSVFileExtractor(file_path="sample_data/sample.csv")
content = extractor.load_from_csv()
print(content)

Constructor parameters:

Parameter	Description
`file_path`	Path to the CSV file

DocxTextExtractor

Extracts plain text from Microsoft Word (.docx) files.

python

from elsai_text_extractors.docx_extractor import DocxTextExtractor

extractor = DocxTextExtractor(file_path="sample_data/sample.docx")
content = extractor.extract_text_from_docx()
print(content)

Constructor parameters:

Parameter	Description
`file_path`	Path to the `.docx` file

PyPDFTextExtractor

Extracts text from PDF files using PyPDF.

python

from elsai_text_extractors.pypdfloader import PyPDFTextExtractor

extractor = PyPDFTextExtractor(file_path="sample_data/sample.pdf")
content = extractor.extract_text_from_pdf()
print(content)

Constructor parameters:

Parameter	Description
`file_path`	Path to the PDF file

UnstructuredExcelLoaderService

Loads content from Excel (.xlsx) files using the Unstructured library.

python

from elsai_text_extractors.unstructured_excel_loader import UnstructuredExcelLoaderService

extractor = UnstructuredExcelLoaderService(file_path="sample_data/sample.xlsx")
content = extractor.load_excel()
print(content)

Constructor parameters:

Parameter	Description
`file_path`	Path to the `.xlsx` file

MarkItDownTextExtractor

Converts a wide range of document formats into clean Markdown text using the MarkItDown library. Useful when you need consistent text output across mixed document types.

Supported formats: PDF, DOCX, PPTX, XLSX, and other formats supported by MarkItDown.

python

from elsai_text_extractors.markitdown import MarkItDownTextExtractor

extractor = MarkItDownTextExtractor(file_path="/path/to/document.pdf")
content = extractor.extract_text_from_file()
print(content)

Constructor parameters:

Parameter	Description
`file_path`	Path to the document file

ExcelParser

Package: elsai-parsers v0.1.0

Queries Excel files using natural language. Unlike the extractors above which return raw text, ExcelParser indexes the Excel data into a vector database and uses an LLM to answer questions about it — including returning structured JSON responses.

bash

pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-parsers/ elsai-parsers==0.1.0

python

from elsai_parsers.excel_parser import ExcelParser
from elsai_model.azure_openai import AzureOpenAIConnector
from elsai_embeddings.azure_embeddings import AzureOpenAIEmbeddingModel
from elsai_vectordb.chromadb import ChromaVectorDb

llm = AzureOpenAIConnector(...)
embedding_model = AzureOpenAIEmbeddingModel(...)
vector_db = ChromaVectorDb(persist_directory="./excel_db")

parser = ExcelParser(
    file_path="data/sales_report.xlsx",
    config={
        "vector_database": vector_db,
        "embedding_function": embedding_model,
        "llm": llm,
        "vector_store_type": "chroma",
    },
)

# Natural language query — returns plain text
result = parser.parse(user_prompt="What is the total revenue for Q3?")
print(result)

# With a JSON template — returns structured dict
result = parser.parse(
    user_prompt="List the top 5 products by sales.",
    json_template='{"products": [{"name": "", "sales": 0}]}',
)
print(result)

Constructor parameters:

Parameter	Description
`file_path`	Path to the Excel (`.xlsx`) file
`config`	Configuration dictionary (see below)

config dictionary keys:

Key	Description
`vector_database`	Vector database instance for indexing Excel content (e.g. `ChromaVectorDb`)
`embedding_function`	Embedding model instance for vectorizing content
`llm`	LLM connector instance for answering queries
`vector_store_type`	String identifier for the vector store type (e.g. `"chroma"`)

parse() parameters:

Parameter	Description
`user_prompt`	Natural language question about the Excel data
`json_template`	Optional JSON string template — when provided, the response is structured to match the template schema

Returns a plain text string when json_template is omitted, or a structured dict when a template is provided.

Version history

Version	Changes
0.1.0	Initial release — CSV, DOCX, PDF, Excel, MarkItDown extractors

elsai Text Extractors ​

Installation ​

Available extractors ​

CSVFileExtractor ​

DocxTextExtractor ​

PyPDFTextExtractor ​

UnstructuredExcelLoaderService ​

MarkItDownTextExtractor ​

ExcelParser ​

Version history ​

elsai Text Extractors

Installation

Available extractors

CSVFileExtractor

DocxTextExtractor

PyPDFTextExtractor

UnstructuredExcelLoaderService

MarkItDownTextExtractor

ExcelParser

Version history