Appearance
Elsai OCR Extractors
Package: elsai-ocr-extractors v2.0.1
A unified collection of OCR integrations for extracting text, tables, and structured data from scanned documents and images. Each extractor wraps a different OCR backend — pick the one that fits your document type, cloud provider, and accuracy requirements.
Installation
bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-ocr-extractors/ elsai-ocr-extractors==2.0.1Requirements: Python >= 3.9
Available extractors
| Extractor | Class | Best for |
|---|---|---|
| Azure Cognitive Services | AzureCognitiveService | General-purpose PDF text extraction |
| Azure Document Intelligence | AzureDocumentIntelligence | Structured documents — tables and key-value fields |
| LlamaParse | LlamaParseExtractor | CSV data parsing via LlamaParse API |
| VisionAI | VisionAIExtractor | PDFs and images via GPT-4o vision (sync + async) |
| Amazon Boto3 | AmazonBoto3Connector | AWS Textract via boto3 — S3 and local files |
| Amazon Textractor | AmazonTextractor | AWS Textract via textractor library |
| Mistral OCR | MistralOCR | Complex layouts, annotations, document Q&A |
Supported file formats
| Extractor | PNG | JPG/JPEG | TIFF | BMP | GIF | WEBP | CSV | |
|---|---|---|---|---|---|---|---|---|
AzureCognitiveService | ✅ | — | — | — | — | — | — | — |
AzureDocumentIntelligence | ✅ | ✅ | ✅ | ✅ | — | — | — | — |
LlamaParseExtractor | — | — | — | — | — | — | — | ✅ |
VisionAIExtractor | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — |
AmazonBoto3Connector | ✅ | ✅ | ✅ | ✅ | — | — | — | — |
AmazonTextractor | ✅ | ✅ | ✅ | ✅ | — | — | — | — |
MistralOCR | ✅ | — | — | — | — | — | — | — |
AzureCognitiveService
Extracts plain text from PDFs using Azure Cognitive Services (Computer Vision Read API). Best for scanned documents and printed text where you only need the raw text content.
Required credentials: AZURE_SUBSCRIPTION_KEY, Azure endpoint URL
python
from elsai_ocr_extractors.azure_cognitive_service import AzureCognitiveService
extractor = AzureCognitiveService(
file_path="path/to/your/document.pdf",
subscription_key="your-subscription-key",
endpoint="https://your-resource.cognitiveservices.azure.com/",
)
extracted_text = extractor.extract_text_from_pdf()
print(extracted_text)Parameters:
| Parameter | Description |
|---|---|
file_path | Path to the PDF file to extract text from |
subscription_key | Azure Cognitive Services subscription key |
endpoint | Azure Cognitive Services endpoint URL |
AzureDocumentIntelligence
Extracts both text and structured data (tables, key-value pairs) from documents using Azure Document Intelligence (formerly Form Recognizer). Use this when you need more than raw text — invoices, forms, receipts, or any document with a defined structure.
Supported formats: PDF, PNG, JPG, JPEG, TIFF
Required credentials: Azure Document Intelligence endpoint and key
python
from elsai_ocr_extractors.azure_document_intelligence import AzureDocumentIntelligence
extractor = AzureDocumentIntelligence(
file_path="path/to/your/document.pdf",
vision_endpoint="https://your-resource.cognitiveservices.azure.com/",
vision_key="your-vision-key",
)
# Extract plain text
extracted_text = extractor.extract_text()
print(extracted_text)
# Extract tables as structured data
extracted_tables = extractor.extract_tables()
print(extracted_tables)Parameters:
| Parameter | Description |
|---|---|
file_path | Path to the document file |
vision_endpoint | Azure Document Intelligence endpoint URL |
vision_key | Azure Document Intelligence API key |
TIP
Use extract_tables() for documents like invoices, spreadsheets, or financial reports where tabular structure matters. Use extract_text() for contracts, reports, or any free-form text document.
LlamaParseExtractor
Parses CSV files using the LlamaParse API. Unlike the other extractors, this is not an image/PDF OCR tool — it is designed specifically for loading and parsing structured CSV data into a format your agents can reason over.
Supported formats: CSV
Required credentials: LlamaParse API key (from LlamaCloud)
python
from elsai_ocr_extractors.llama_parse_extractor import LlamaParseExtractor
extractor = LlamaParseExtractor(api_key="your-llama-cloud-api-key")
loaded_data = extractor.load_csv("path/to/your/file.csv")
print(loaded_data)Parameters:
| Parameter | Description |
|---|---|
api_key | LlamaParse API key |
VisionAIExtractor
Extracts text from PDFs and images by sending each page through a GPT-4o-class vision model. Supports both synchronous extraction and asynchronous batch processing for large documents.
Supported formats: PDF, PNG, JPG, JPEG, GIF, BMP, TIFF, WEBP
Required credentials: OpenAI API key
python
from elsai_ocr_extractors.visionai_extractor import VisionAIExtractor
extractor = VisionAIExtractor(
api_key="your-openai-api-key",
model_name="gpt-4o-mini", # or "gpt-4o" for higher accuracy
)
# Extract from a PDF
response = extractor.extract_text_from_file(file_path="/path/to/document.pdf")
print(response[0].page_content)
# Extract from an image
response = extractor.extract_text_from_file(file_path="/path/to/image.png")
print(response[0].page_content)Async extraction with configurable batch size:
For large documents, use the async method to process pages in parallel batches:
python
import asyncio
async def extract_large_document():
extractor = VisionAIExtractor(
api_key="your-openai-api-key",
model_name="gpt-4o-mini",
)
response = await extractor.extract_text_from_file_async(
file_path="/path/to/large-document.pdf",
batch_size=10, # process 10 pages at a time
)
for page in response:
print(page.page_content)
asyncio.run(extract_large_document())Parameters:
| Parameter | Description |
|---|---|
api_key | OpenAI API key |
model_name | Vision model to use (e.g. "gpt-4o-mini", "gpt-4o") |
Return value: A list of document objects, each with a page_content attribute containing the extracted text for that page.
AmazonBoto3Connector
Processes documents using AWS Textract via the boto3 SDK. Supports both synchronous (single-page, inline) and asynchronous (multi-page, S3-backed) processing. The asynchronous path is required for multi-page PDFs on Textract.
Supported formats: PDF, PNG, JPG, JPEG, TIFF
Required credentials: AWS access key, secret key, session token, region
python
from elsai_ocr_extractors.amazon_boto3 import AmazonBoto3Connector
client = AmazonBoto3Connector(
access_key="your-aws-access-key",
secret_key="your-aws-secret-key",
session_token="your-session-token",
region_name="us-east-1",
)
# Synchronous — for S3-hosted single-page documents
resp = client.sync_process_document(
file_source="s3://your-bucket/folder/document.pdf",
feature_list=["text", "tables", "forms", "layout"],
)
print(resp)
# Asynchronous — for multi-page documents (uploads to S3 first)
resp = client.async_process_document(
file_source="/path/to/local/document.pdf",
feature_list=["text", "tables", "forms", "layout"],
s3_bucket="your-bucket",
s3_folder="uploads",
)
print(resp)Parameters:
| Parameter | Description |
|---|---|
access_key | AWS access key ID |
secret_key | AWS secret access key |
session_token | AWS session token (for temporary credentials) |
region_name | AWS region where Textract is called (e.g. "us-east-1") |
feature_list options:
| Feature | What it extracts |
|---|---|
"text" | Raw text lines and words |
"tables" | Tabular data with rows and cells |
"forms" | Key-value pairs from form fields |
"layout" | Document layout and reading order |
Sync vs async
Use sync_process_document() for single-page images or S3-hosted files. Use async_process_document() for local multi-page PDFs — the connector uploads the file to S3 automatically and polls for the result.
AmazonTextractor
Processes documents using AWS Textract via the amazon-textract-textractor library, which provides richer structured output than the raw boto3 response. Use this when you want higher-level document objects (blocks, tables, queries) rather than raw API responses.
Supported formats: PDF, PNG, JPG, JPEG, TIFF
Required credentials: AWS access key, secret key, session token, region
python
from elsai_ocr_extractors.amazon_textractor import AmazonTextractor
client = AmazonTextractor(
access_key="your-aws-access-key",
secret_key="your-aws-secret-key",
session_token="your-session-token",
region_name="us-east-1",
)
# Synchronous — S3-hosted documents
resp = client.sync_process_document(
file_source="s3://your-bucket/folder/document.pdf",
feature_list=["text", "tables", "forms", "layout"],
)
print(resp)Parameters:
| Parameter | Description |
|---|---|
access_key | AWS access key ID |
secret_key | AWS secret access key |
session_token | AWS session token |
region_name | AWS region (e.g. "us-east-1") |
MistralOCR
Extracts text from PDFs using the Mistral OCR API. Goes beyond basic extraction — supports structured annotations via Pydantic models and document question-answering. Best for complex document layouts with mixed text, tables, and images.
Supported formats: PDF
Required credentials: Mistral API key
Basic extraction
python
from elsai_ocr_extractors.mistral_ocr import MistralOCR
ocr = MistralOCR(
file_path="path/to/your/document.pdf",
api_key="your-mistral-api-key",
)
result = ocr.extract()
print(result)Structured annotation with Pydantic models
Define custom Pydantic models to extract structured metadata about images and the document itself:
python
from pydantic import BaseModel, Field
from elsai_ocr_extractors.mistral_ocr import MistralOCR
class MyBBoxFormat(BaseModel):
image_type: str = Field(..., description="Type of the image")
summary: str = Field(..., description="Summary of image content")
class MyDocFormat(BaseModel):
language: str = Field(..., description="Language of the document")
chapter_titles: list[str] = Field(..., description="Chapter titles in the document")
urls: list[str] = Field(..., description="List of URLs found in the document")
ocr = MistralOCR(
file_path="path/to/your/document.pdf",
api_key="your-mistral-api-key",
)
annotated_result = ocr.extract(
bbox_annotation_model=MyBBoxFormat, # applied to each image bounding box
document_annotation_model=MyDocFormat, # applied to the whole document
)
print(annotated_result)Document question-answering
Ask a natural-language question directly about the document content:
python
answer = ocr.ask_question("What is the last sentence in the document?")
print(answer)Parameters:
| Parameter | Description |
|---|---|
file_path | Path to the PDF file |
api_key | Mistral API key |
extract() options:
| Parameter | Description |
|---|---|
bbox_annotation_model | Pydantic model applied to each detected image region |
document_annotation_model | Pydantic model applied to the full document for metadata extraction |
Version history
| Version | Changes |
|---|---|
| 2.0.1 | Current stable release |
| 2.0.0 | Added MistralOCR, AmazonBoto3Connector, AmazonTextractor |
| 1.0.0 | VisionAIExtractor sync/async + batch sizes; AzureDocumentIntelligence keyword args |