Elsai Text Extractors#

The Elsai Text Extractors package provides utilities to extract structured or unstructured text data from various document formats. It supports:

  • CSV files

  • DOCX files

  • PDF documents

  • Excel spreadsheets

Prerequisites#

  • Python >= 3.9

Installation#

To install the elsai-text-extractors package:

pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-text-extractors/ elsai-text-extractors==0.1.0

Components#

1. CSVFileExtractor#

CSVFileExtractor reads and parses CSV files from the provided file path.

from elsai_text_extractors.csv_extractor import CSVFileExtractor

extractor = CSVFileExtractor(file_path="sample_data/sample.csv")
content = extractor.load_from_csv()

2. DocxTextExtractor#

DocxTextExtractor extracts plain text from Microsoft Word (.docx) documents.

from elsai_text_extractors.docx_extractor import DocxTextExtractor

extractor = DocxTextExtractor(file_path="sample_data/sample.docx")
content = extractor.extract_text_from_docx()

3. PyPDFTextExtractor#

PyPDFTextExtractor extracts text from PDF documents using the PyPDF library.

Note

Important: In version 0.1.0, the PyPDF extractor only extracted the first page. This issue has been fixed in version 0.1.1, which now extracts all pages from PDF documents.

from elsai_text_extractors.pypdfloader import PyPDFTextExtractor

extractor = PyPDFTextExtractor(file_path="sample_data/sample.pdf")
content = extractor.extract_text_from_pdf()

4. UnstructuredExcelLoader#

UnstructuredExcelLoaderService loads and parses Excel spreadsheets (.xlsx), handling unstructured formats gracefully.

from elsai_text_extractors.unstructured_excel_loader import UnstructuredExcelLoaderService

extractor = UnstructuredExcelLoaderService(file_path="sample_data/sample.xlsx")
content = extractor.load_excel()

5. MarkItDownExtractor#

MarkItDownTextExtractor is a powerful text extraction component that leverages Microsoft’s MarkItDown library to convert various document formats into structured markdown text. It supports a wide range of file types including PDFs, Word documents, PowerPoint presentations, Excel spreadsheets, and more, providing high-quality text extraction with preserved formatting and structure. This component is available only in version 0.2.0 and later.

Note

Version Requirement: MarkItDownExtractor is only available in elsai-text-extractors version 0.2.0 and later.

from elsai_text_extractors.markitdown import MarkItDownTextExtractor

extractor = MarkItDownTextExtractor(file_path="/path/to/document.pdf")
content = extractor.extract_text_from_file()