elsai OCR Extractors

Package: elsai-ocr-extractors v2.0.1

A unified collection of OCR integrations for extracting text, tables, and structured data from scanned documents and images. Each extractor wraps a different OCR backend — pick the one that fits your document type, cloud provider, and accuracy requirements.

Installation

bash

pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-ocr-extractors/ elsai-ocr-extractors==2.0.1

Requirements: Python >= 3.9

Available extractors

Extractor	Class	Best for
Azure Cognitive Services	`AzureCognitiveService`	General-purpose PDF text extraction
Azure Document Intelligence	`AzureDocumentIntelligence`	Structured documents — tables and key-value fields
LlamaParse	`LlamaParseExtractor`	CSV data parsing via LlamaParse API
VisionAI	`VisionAIExtractor`	PDFs and images via GPT-4o vision (sync + async)
Amazon Boto3	`AmazonBoto3Connector`	AWS Textract via boto3 — S3 and local files
Amazon Textractor	`AmazonTextractor`	AWS Textract via textractor library
Mistral OCR	`MistralOCR`	Complex layouts, annotations, document Q&A

Supported file formats

Extractor	PDF	PNG	JPG/JPEG	TIFF	BMP	GIF	WEBP	CSV
`AzureCognitiveService`	✅	—	—	—	—	—	—	—
`AzureDocumentIntelligence`	✅	✅	✅	✅	—	—	—	—
`LlamaParseExtractor`	—	—	—	—	—	—	—	✅
`VisionAIExtractor`	✅	✅	✅	✅	✅	✅	✅	—
`AmazonBoto3Connector`	✅	✅	✅	✅	—	—	—	—
`AmazonTextractor`	✅	✅	✅	✅	—	—	—	—
`MistralOCR`	✅	—	—	—	—	—	—	—

AzureCognitiveService

Extracts plain text from PDFs using Azure Cognitive Services (Computer Vision Read API). Best for scanned documents and printed text where you only need the raw text content.

Required credentials: AZURE_SUBSCRIPTION_KEY, Azure endpoint URL

python

from elsai_ocr_extractors.azure_cognitive_service import AzureCognitiveService

extractor = AzureCognitiveService(
    file_path="path/to/your/document.pdf",
    subscription_key="your-subscription-key",
    endpoint="https://your-resource.cognitiveservices.azure.com/",
)

extracted_text = extractor.extract_text_from_pdf()
print(extracted_text)

Parameters:

Parameter	Description
`file_path`	Path to the PDF file to extract text from
`subscription_key`	Azure Cognitive Services subscription key
`endpoint`	Azure Cognitive Services endpoint URL

AzureDocumentIntelligence

Extracts both text and structured data (tables, key-value pairs) from documents using Azure Document Intelligence (formerly Form Recognizer). Use this when you need more than raw text — invoices, forms, receipts, or any document with a defined structure.

Supported formats: PDF, PNG, JPG, JPEG, TIFF

Required credentials: Azure Document Intelligence endpoint and key

python

from elsai_ocr_extractors.azure_document_intelligence import AzureDocumentIntelligence

extractor = AzureDocumentIntelligence(
    file_path="path/to/your/document.pdf",
    vision_endpoint="https://your-resource.cognitiveservices.azure.com/",
    vision_key="your-vision-key",
)

# Extract plain text
extracted_text = extractor.extract_text()
print(extracted_text)

# Extract tables as structured data
extracted_tables = extractor.extract_tables()
print(extracted_tables)

Parameters:

Parameter	Description
`file_path`	Path to the document file
`vision_endpoint`	Azure Document Intelligence endpoint URL
`vision_key`	Azure Document Intelligence API key

TIP

Use extract_tables() for documents like invoices, spreadsheets, or financial reports where tabular structure matters. Use extract_text() for contracts, reports, or any free-form text document.

LlamaParseExtractor

Parses CSV files using the LlamaParse API. Unlike the other extractors, this is not an image/PDF OCR tool — it is designed specifically for loading and parsing structured CSV data into a format your agents can reason over.

Supported formats: CSV

Required credentials: LlamaParse API key (from LlamaCloud)

python

from elsai_ocr_extractors.llama_parse_extractor import LlamaParseExtractor

extractor = LlamaParseExtractor(api_key="your-llama-cloud-api-key")

loaded_data = extractor.load_csv("path/to/your/file.csv")
print(loaded_data)

Parameters:

Parameter	Description
`api_key`	LlamaParse API key

VisionAIExtractor

Extracts text from PDFs and images by sending each page through a GPT-4o-class vision model. Supports both synchronous extraction and asynchronous batch processing for large documents.

Supported formats: PDF, PNG, JPG, JPEG, GIF, BMP, TIFF, WEBP

Required credentials: OpenAI API key

python

from elsai_ocr_extractors.visionai_extractor import VisionAIExtractor

extractor = VisionAIExtractor(
    api_key="your-openai-api-key",
    model_name="gpt-4o-mini",   # or "gpt-4o" for higher accuracy
)

# Extract from a PDF
response = extractor.extract_text_from_file(file_path="/path/to/document.pdf")
print(response[0].page_content)

# Extract from an image
response = extractor.extract_text_from_file(file_path="/path/to/image.png")
print(response[0].page_content)

Async extraction with configurable batch size:

For large documents, use the async method to process pages in parallel batches:

python

import asyncio

async def extract_large_document():
    extractor = VisionAIExtractor(
        api_key="your-openai-api-key",
        model_name="gpt-4o-mini",
    )
    response = await extractor.extract_text_from_file_async(
        file_path="/path/to/large-document.pdf",
        batch_size=10,   # process 10 pages at a time
    )
    for page in response:
        print(page.page_content)

asyncio.run(extract_large_document())

Parameters:

Parameter	Description
`api_key`	OpenAI API key
`model_name`	Vision model to use (e.g. `"gpt-4o-mini"`, `"gpt-4o"`)

Return value: A list of document objects, each with a page_content attribute containing the extracted text for that page.

AmazonBoto3Connector

Processes documents using AWS Textract via the boto3 SDK. Supports both synchronous (single-page, inline) and asynchronous (multi-page, S3-backed) processing. The asynchronous path is required for multi-page PDFs on Textract.

Supported formats: PDF, PNG, JPG, JPEG, TIFF

Required credentials: AWS access key, secret key, session token, region

python

from elsai_ocr_extractors.amazon_boto3 import AmazonBoto3Connector

client = AmazonBoto3Connector(
    access_key="your-aws-access-key",
    secret_key="your-aws-secret-key",
    session_token="your-session-token",
    region_name="us-east-1",
)

# Synchronous — for S3-hosted single-page documents
resp = client.sync_process_document(
    file_source="s3://your-bucket/folder/document.pdf",
    feature_list=["text", "tables", "forms", "layout"],
)
print(resp)

# Asynchronous — for multi-page documents (uploads to S3 first)
resp = client.async_process_document(
    file_source="/path/to/local/document.pdf",
    feature_list=["text", "tables", "forms", "layout"],
    s3_bucket="your-bucket",
    s3_folder="uploads",
)
print(resp)

Parameters:

Parameter	Description
`access_key`	AWS access key ID
`secret_key`	AWS secret access key
`session_token`	AWS session token (for temporary credentials)
`region_name`	AWS region where Textract is called (e.g. `"us-east-1"`)

feature_list options:

Feature	What it extracts
`"text"`	Raw text lines and words
`"tables"`	Tabular data with rows and cells
`"forms"`	Key-value pairs from form fields
`"layout"`	Document layout and reading order

Sync vs async

Use sync_process_document() for single-page images or S3-hosted files. Use async_process_document() for local multi-page PDFs — the connector uploads the file to S3 automatically and polls for the result.

AmazonTextractor

Processes documents using AWS Textract via the amazon-textract-textractor library, which provides richer structured output than the raw boto3 response. Use this when you want higher-level document objects (blocks, tables, queries) rather than raw API responses.

Supported formats: PDF, PNG, JPG, JPEG, TIFF

Required credentials: AWS access key, secret key, session token, region

python

from elsai_ocr_extractors.amazon_textractor import AmazonTextractor

client = AmazonTextractor(
    access_key="your-aws-access-key",
    secret_key="your-aws-secret-key",
    session_token="your-session-token",
    region_name="us-east-1",
)

# Synchronous — S3-hosted documents
resp = client.sync_process_document(
    file_source="s3://your-bucket/folder/document.pdf",
    feature_list=["text", "tables", "forms", "layout"],
)
print(resp)

Parameters:

Parameter	Description
`access_key`	AWS access key ID
`secret_key`	AWS secret access key
`session_token`	AWS session token
`region_name`	AWS region (e.g. `"us-east-1"`)

MistralOCR

Extracts text from PDFs using the Mistral OCR API. Goes beyond basic extraction — supports structured annotations via Pydantic models and document question-answering. Best for complex document layouts with mixed text, tables, and images.

Supported formats: PDF

Required credentials: Mistral API key

Basic extraction

python

from elsai_ocr_extractors.mistral_ocr import MistralOCR

ocr = MistralOCR(
    file_path="path/to/your/document.pdf",
    api_key="your-mistral-api-key",
)

result = ocr.extract()
print(result)

Structured annotation with Pydantic models

Define custom Pydantic models to extract structured metadata about images and the document itself:

python

from pydantic import BaseModel, Field
from elsai_ocr_extractors.mistral_ocr import MistralOCR

class MyBBoxFormat(BaseModel):
    image_type: str = Field(..., description="Type of the image")
    summary: str = Field(..., description="Summary of image content")

class MyDocFormat(BaseModel):
    language: str = Field(..., description="Language of the document")
    chapter_titles: list[str] = Field(..., description="Chapter titles in the document")
    urls: list[str] = Field(..., description="List of URLs found in the document")

ocr = MistralOCR(
    file_path="path/to/your/document.pdf",
    api_key="your-mistral-api-key",
)

annotated_result = ocr.extract(
    bbox_annotation_model=MyBBoxFormat,     # applied to each image bounding box
    document_annotation_model=MyDocFormat,  # applied to the whole document
)
print(annotated_result)

Document question-answering

Ask a natural-language question directly about the document content:

python

answer = ocr.ask_question("What is the last sentence in the document?")
print(answer)

Parameters:

Parameter	Description
`file_path`	Path to the PDF file
`api_key`	Mistral API key

extract() options:

Parameter	Description
`bbox_annotation_model`	Pydantic model applied to each detected image region
`document_annotation_model`	Pydantic model applied to the full document for metadata extraction

Version history

Version	Changes
2.0.1	Current stable release
2.0.0	Added `MistralOCR`, `AmazonBoto3Connector`, `AmazonTextractor`
1.0.0	`VisionAIExtractor` sync/async + batch sizes; `AzureDocumentIntelligence` keyword args

elsai OCR Extractors ​

Installation ​

Available extractors ​

Supported file formats ​

AzureCognitiveService ​

AzureDocumentIntelligence ​

LlamaParseExtractor ​

VisionAIExtractor ​

AmazonBoto3Connector ​

AmazonTextractor ​

MistralOCR ​

Basic extraction ​

Structured annotation with Pydantic models ​

Document question-answering ​

Version history ​

elsai OCR Extractors

Installation

Available extractors

Supported file formats

AzureCognitiveService

AzureDocumentIntelligence

LlamaParseExtractor

VisionAIExtractor

AmazonBoto3Connector

AmazonTextractor

MistralOCR

Basic extraction

Structured annotation with Pydantic models

Document question-answering

Version history