Elsai Text Extractors#
The Elsai Text Extractors package provides utilities to extract structured or unstructured text data from various document formats. It supports:
CSV files
DOCX files
PDF documents
Excel spreadsheets
Prerequisites#
Python >= 3.9
Installation#
To install the elsai-text-extractors package:
pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-text-extractors/ elsai-text-extractors==0.1.0
Components#
1. CSVFileExtractor#
CSVFileExtractor reads and parses CSV files from the provided file path.
from elsai_text_extractors.csv_extractor import CSVFileExtractor
extractor = CSVFileExtractor(file_path="sample_data/sample.csv")
content = extractor.load_from_csv()
2. DocxTextExtractor#
DocxTextExtractor extracts plain text from Microsoft Word (.docx) documents.
from elsai_text_extractors.docx_extractor import DocxTextExtractor
extractor = DocxTextExtractor(file_path="sample_data/sample.docx")
content = extractor.extract_text_from_docx()
3. PyPDFTextExtractor#
PyPDFTextExtractor extracts text from PDF documents using the PyPDF library.
Note
Important: In version 0.1.0, the PyPDF extractor only extracted the first page. This issue has been fixed in version 0.1.1, which now extracts all pages from PDF documents.
from elsai_text_extractors.pypdfloader import PyPDFTextExtractor
extractor = PyPDFTextExtractor(file_path="sample_data/sample.pdf")
content = extractor.extract_text_from_pdf()
4. UnstructuredExcelLoader#
UnstructuredExcelLoaderService loads and parses Excel spreadsheets (.xlsx), handling unstructured formats gracefully.
from elsai_text_extractors.unstructured_excel_loader import UnstructuredExcelLoaderService
extractor = UnstructuredExcelLoaderService(file_path="sample_data/sample.xlsx")
content = extractor.load_excel()