Elsai Text Extractors#
The Elsai Text Extractors package provides utilities to extract structured or unstructured text data from various document formats. It supports:
- CSV files 
- DOCX files 
- PDF documents 
- Excel spreadsheets 
Prerequisites#
- Python >= 3.9 
Installation#
To install the elsai-text-extractors package:
pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-text-extractors/ elsai-text-extractors==0.1.0
Components#
1. CSVFileExtractor#
CSVFileExtractor reads and parses CSV files from the provided file path.
from elsai_text_extractors.csv_extractor import CSVFileExtractor
extractor = CSVFileExtractor(file_path="sample_data/sample.csv")
content = extractor.load_from_csv()
2. DocxTextExtractor#
DocxTextExtractor extracts plain text from Microsoft Word (.docx) documents.
from elsai_text_extractors.docx_extractor import DocxTextExtractor
extractor = DocxTextExtractor(file_path="sample_data/sample.docx")
content = extractor.extract_text_from_docx()
3. PyPDFTextExtractor#
PyPDFTextExtractor extracts text from PDF documents using the PyPDF library.
Note
Important: In version 0.1.0, the PyPDF extractor only extracted the first page. This issue has been fixed in version 0.1.1, which now extracts all pages from PDF documents.
from elsai_text_extractors.pypdfloader import PyPDFTextExtractor
extractor = PyPDFTextExtractor(file_path="sample_data/sample.pdf")
content = extractor.extract_text_from_pdf()
4. UnstructuredExcelLoader#
UnstructuredExcelLoaderService loads and parses Excel spreadsheets (.xlsx), handling unstructured formats gracefully.
from elsai_text_extractors.unstructured_excel_loader import UnstructuredExcelLoaderService
extractor = UnstructuredExcelLoaderService(file_path="sample_data/sample.xlsx")
content = extractor.load_excel()
5. MarkItDownExtractor#
MarkItDownTextExtractor is a powerful text extraction component that leverages Microsoft’s MarkItDown library to convert various document formats into structured markdown text. It supports a wide range of file types including PDFs, Word documents, PowerPoint presentations, Excel spreadsheets, and more, providing high-quality text extraction with preserved formatting and structure. This component is available only in version 0.2.0 and later.
Note
Version Requirement: MarkItDownExtractor is only available in elsai-text-extractors version 0.2.0 and later.
from elsai_text_extractors.markitdown import MarkItDownTextExtractor
extractor = MarkItDownTextExtractor(file_path="/path/to/document.pdf")
content = extractor.extract_text_from_file()
