Skip to content

Elsai OCR Extractors

Package: elsai-ocr-extractors  v2.0.1

A unified collection of OCR integrations for extracting text, tables, and structured data from scanned documents and images. Each extractor wraps a different OCR backend — pick the one that fits your document type, cloud provider, and accuracy requirements.

Installation

bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-ocr-extractors/ elsai-ocr-extractors==2.0.1

Requirements: Python >= 3.9


Available extractors

ExtractorClassBest for
Azure Cognitive ServicesAzureCognitiveServiceGeneral-purpose PDF text extraction
Azure Document IntelligenceAzureDocumentIntelligenceStructured documents — tables and key-value fields
LlamaParseLlamaParseExtractorCSV data parsing via LlamaParse API
VisionAIVisionAIExtractorPDFs and images via GPT-4o vision (sync + async)
Amazon Boto3AmazonBoto3ConnectorAWS Textract via boto3 — S3 and local files
Amazon TextractorAmazonTextractorAWS Textract via textractor library
Mistral OCRMistralOCRComplex layouts, annotations, document Q&A

Supported file formats

ExtractorPDFPNGJPG/JPEGTIFFBMPGIFWEBPCSV
AzureCognitiveService
AzureDocumentIntelligence
LlamaParseExtractor
VisionAIExtractor
AmazonBoto3Connector
AmazonTextractor
MistralOCR

AzureCognitiveService

Extracts plain text from PDFs using Azure Cognitive Services (Computer Vision Read API). Best for scanned documents and printed text where you only need the raw text content.

Required credentials: AZURE_SUBSCRIPTION_KEY, Azure endpoint URL

python
from elsai_ocr_extractors.azure_cognitive_service import AzureCognitiveService

extractor = AzureCognitiveService(
    file_path="path/to/your/document.pdf",
    subscription_key="your-subscription-key",
    endpoint="https://your-resource.cognitiveservices.azure.com/",
)

extracted_text = extractor.extract_text_from_pdf()
print(extracted_text)

Parameters:

ParameterDescription
file_pathPath to the PDF file to extract text from
subscription_keyAzure Cognitive Services subscription key
endpointAzure Cognitive Services endpoint URL

AzureDocumentIntelligence

Extracts both text and structured data (tables, key-value pairs) from documents using Azure Document Intelligence (formerly Form Recognizer). Use this when you need more than raw text — invoices, forms, receipts, or any document with a defined structure.

Supported formats: PDF, PNG, JPG, JPEG, TIFF

Required credentials: Azure Document Intelligence endpoint and key

python
from elsai_ocr_extractors.azure_document_intelligence import AzureDocumentIntelligence

extractor = AzureDocumentIntelligence(
    file_path="path/to/your/document.pdf",
    vision_endpoint="https://your-resource.cognitiveservices.azure.com/",
    vision_key="your-vision-key",
)

# Extract plain text
extracted_text = extractor.extract_text()
print(extracted_text)

# Extract tables as structured data
extracted_tables = extractor.extract_tables()
print(extracted_tables)

Parameters:

ParameterDescription
file_pathPath to the document file
vision_endpointAzure Document Intelligence endpoint URL
vision_keyAzure Document Intelligence API key

TIP

Use extract_tables() for documents like invoices, spreadsheets, or financial reports where tabular structure matters. Use extract_text() for contracts, reports, or any free-form text document.


LlamaParseExtractor

Parses CSV files using the LlamaParse API. Unlike the other extractors, this is not an image/PDF OCR tool — it is designed specifically for loading and parsing structured CSV data into a format your agents can reason over.

Supported formats: CSV

Required credentials: LlamaParse API key (from LlamaCloud)

python
from elsai_ocr_extractors.llama_parse_extractor import LlamaParseExtractor

extractor = LlamaParseExtractor(api_key="your-llama-cloud-api-key")

loaded_data = extractor.load_csv("path/to/your/file.csv")
print(loaded_data)

Parameters:

ParameterDescription
api_keyLlamaParse API key

VisionAIExtractor

Extracts text from PDFs and images by sending each page through a GPT-4o-class vision model. Supports both synchronous extraction and asynchronous batch processing for large documents.

Supported formats: PDF, PNG, JPG, JPEG, GIF, BMP, TIFF, WEBP

Required credentials: OpenAI API key

python
from elsai_ocr_extractors.visionai_extractor import VisionAIExtractor

extractor = VisionAIExtractor(
    api_key="your-openai-api-key",
    model_name="gpt-4o-mini",   # or "gpt-4o" for higher accuracy
)

# Extract from a PDF
response = extractor.extract_text_from_file(file_path="/path/to/document.pdf")
print(response[0].page_content)

# Extract from an image
response = extractor.extract_text_from_file(file_path="/path/to/image.png")
print(response[0].page_content)

Async extraction with configurable batch size:

For large documents, use the async method to process pages in parallel batches:

python
import asyncio

async def extract_large_document():
    extractor = VisionAIExtractor(
        api_key="your-openai-api-key",
        model_name="gpt-4o-mini",
    )
    response = await extractor.extract_text_from_file_async(
        file_path="/path/to/large-document.pdf",
        batch_size=10,   # process 10 pages at a time
    )
    for page in response:
        print(page.page_content)

asyncio.run(extract_large_document())

Parameters:

ParameterDescription
api_keyOpenAI API key
model_nameVision model to use (e.g. "gpt-4o-mini", "gpt-4o")

Return value: A list of document objects, each with a page_content attribute containing the extracted text for that page.


AmazonBoto3Connector

Processes documents using AWS Textract via the boto3 SDK. Supports both synchronous (single-page, inline) and asynchronous (multi-page, S3-backed) processing. The asynchronous path is required for multi-page PDFs on Textract.

Supported formats: PDF, PNG, JPG, JPEG, TIFF

Required credentials: AWS access key, secret key, session token, region

python
from elsai_ocr_extractors.amazon_boto3 import AmazonBoto3Connector

client = AmazonBoto3Connector(
    access_key="your-aws-access-key",
    secret_key="your-aws-secret-key",
    session_token="your-session-token",
    region_name="us-east-1",
)

# Synchronous — for S3-hosted single-page documents
resp = client.sync_process_document(
    file_source="s3://your-bucket/folder/document.pdf",
    feature_list=["text", "tables", "forms", "layout"],
)
print(resp)

# Asynchronous — for multi-page documents (uploads to S3 first)
resp = client.async_process_document(
    file_source="/path/to/local/document.pdf",
    feature_list=["text", "tables", "forms", "layout"],
    s3_bucket="your-bucket",
    s3_folder="uploads",
)
print(resp)

Parameters:

ParameterDescription
access_keyAWS access key ID
secret_keyAWS secret access key
session_tokenAWS session token (for temporary credentials)
region_nameAWS region where Textract is called (e.g. "us-east-1")

feature_list options:

FeatureWhat it extracts
"text"Raw text lines and words
"tables"Tabular data with rows and cells
"forms"Key-value pairs from form fields
"layout"Document layout and reading order

Sync vs async

Use sync_process_document() for single-page images or S3-hosted files. Use async_process_document() for local multi-page PDFs — the connector uploads the file to S3 automatically and polls for the result.


AmazonTextractor

Processes documents using AWS Textract via the amazon-textract-textractor library, which provides richer structured output than the raw boto3 response. Use this when you want higher-level document objects (blocks, tables, queries) rather than raw API responses.

Supported formats: PDF, PNG, JPG, JPEG, TIFF

Required credentials: AWS access key, secret key, session token, region

python
from elsai_ocr_extractors.amazon_textractor import AmazonTextractor

client = AmazonTextractor(
    access_key="your-aws-access-key",
    secret_key="your-aws-secret-key",
    session_token="your-session-token",
    region_name="us-east-1",
)

# Synchronous — S3-hosted documents
resp = client.sync_process_document(
    file_source="s3://your-bucket/folder/document.pdf",
    feature_list=["text", "tables", "forms", "layout"],
)
print(resp)

Parameters:

ParameterDescription
access_keyAWS access key ID
secret_keyAWS secret access key
session_tokenAWS session token
region_nameAWS region (e.g. "us-east-1")

MistralOCR

Extracts text from PDFs using the Mistral OCR API. Goes beyond basic extraction — supports structured annotations via Pydantic models and document question-answering. Best for complex document layouts with mixed text, tables, and images.

Supported formats: PDF

Required credentials: Mistral API key

Basic extraction

python
from elsai_ocr_extractors.mistral_ocr import MistralOCR

ocr = MistralOCR(
    file_path="path/to/your/document.pdf",
    api_key="your-mistral-api-key",
)

result = ocr.extract()
print(result)

Structured annotation with Pydantic models

Define custom Pydantic models to extract structured metadata about images and the document itself:

python
from pydantic import BaseModel, Field
from elsai_ocr_extractors.mistral_ocr import MistralOCR

class MyBBoxFormat(BaseModel):
    image_type: str = Field(..., description="Type of the image")
    summary: str = Field(..., description="Summary of image content")

class MyDocFormat(BaseModel):
    language: str = Field(..., description="Language of the document")
    chapter_titles: list[str] = Field(..., description="Chapter titles in the document")
    urls: list[str] = Field(..., description="List of URLs found in the document")

ocr = MistralOCR(
    file_path="path/to/your/document.pdf",
    api_key="your-mistral-api-key",
)

annotated_result = ocr.extract(
    bbox_annotation_model=MyBBoxFormat,     # applied to each image bounding box
    document_annotation_model=MyDocFormat,  # applied to the whole document
)
print(annotated_result)

Document question-answering

Ask a natural-language question directly about the document content:

python
answer = ocr.ask_question("What is the last sentence in the document?")
print(answer)

Parameters:

ParameterDescription
file_pathPath to the PDF file
api_keyMistral API key

extract() options:

ParameterDescription
bbox_annotation_modelPydantic model applied to each detected image region
document_annotation_modelPydantic model applied to the full document for metadata extraction

Version history

VersionChanges
2.0.1Current stable release
2.0.0Added MistralOCR, AmazonBoto3Connector, AmazonTextractor
1.0.0VisionAIExtractor sync/async + batch sizes; AzureDocumentIntelligence keyword args

Copyright © 2026 Elsai Foundry.