Elsai OCR Extractors v1.0.0#

The Elsai OCR Extractors package offers seamless integration with OCR services for extracting text from PDF documents with enhanced capabilities for v1.0.0:

  • Azure Cognitive Service

  • Azure Document Intelligence

  • Llama Parse

  • VisionAI

  • AmazonBoto3Connector (Uses boto3)

  • AmazonTextractor (Uses amazon-textract-textractor)

  • Mistral OCR

Prerequisites#

  • Python >= 3.9

  • .env file with appropriate API keys and configuration variables

Installation#

To install the elsai-ocr-extractors package:

pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-ocr-extractors/ elsai-ocr-extractors==1.0.0

Components#

1. AzureCognitiveService#

AzureCognitiveService is a class for extracting text from PDFs using Azure’s Cognitive Services. It supports both direct initialization and environment variable-based configuration.

from elsai_ocr_extractors.azure_cognitive_service import AzureCognitiveService

azure_ocr = AzureCognitiveService(
    file_path="your-file-path-here",
    subscription_key="your-subscription-key-here",
    endpoint="your-endpoint-here"
)  # Or set it as an environment variable

extracted_text = azure_ocr.extract_text_from_pdf()

Required Environment Variables:

  • AZURE_SUBSCRIPTION_KEY – Your Azure OCR subscription key

  • AZURE_ENDPOINT – Your Azure OCR endpoint

2. AzureDocumentIntelligence#

AzureDocumentIntelligence is a class that utilizes Azure’s advanced Document Intelligence to extract structured text from documents like PDFs. It now accepts additional keyword arguments for enhanced flexibility.

from elsai_ocr_extractors.azure_document_intelligence import AzureDocumentIntelligence

azure_ocr = AzureDocumentIntelligence(
    file_path="your-file-path",
    vision_endpoint="your-vision-endpoint",
    vision_key="your-vision-key"
)  # Or set it as an environment variable

extracted_text = azure_ocr.extract_text() #Specify page numbers if needed pages="1,3" -> Optional and also can send **kwargs for additional parameters
extracted_tables = azure_ocr.extract_tables() #Specify page numbers if needed pages="1,3" -> Optional and also can send **kwargs for additional parameters

Required Environment Variables:

  • VISION_ENDPOINT – Endpoint for Azure Document Intelligence

  • VISION_KEY – Subscription key for Azure Document Intelligence

3. Llama Parse Extractor#

LlamaParseExtractor is a wrapper class for interacting with the LlamaParse API to parse and load CSV data.

from elsai_ocr_extractors.llama_parse_extractor import LlamaParseExtractor

llama_parse_extractor = LlamaParseExtractor(api_key="llama_parse_api_key")
loaded_data = llama_parse_extractor.load_csv("path/to/your/file.csv")

4. VisionAI Extractor (Enhanced)#

VisionAIExtractor has been enhanced with new capabilities for v1.0.0:

New Features: - Support for synchronous and asynchronous extraction modes - Configurable batch size for large documents - Dynamic prompt updating before extraction

from elsai_ocr_extractors.visionai_extractor import VisionAIExtractor

# Initialize the extractor
vision_ai = VisionAIExtractor(
    api_key="your_openai_api_key",
    model_name="gpt-4o-mini"
)

# Synchronous extraction
response = vision_ai.extract_text_from_pdf(pdf_path="/path/to/your/document.pdf")
print(response[0].page_content)  # Print the content of the first page

# Asynchronous extraction with configurable batch size
async def extract_text():
    response = await vision_ai.extract_text_from_pdf_async(
        pdf_path="/path/to/your/document.pdf",
        batch_size=2  # Default batch size is None which processes all pages concurrently
    )
    print(response[0].page_content)

import asyncio
asyncio.run(extract_text())  # Run the async function to extract text

Required Environment Variable:

  • OPENAI_API_KEY – Your OpenAI API key for Vision AI access

5. AmazonBoto3Connector#

AmazonBoto3Connector uses boto3 for AWS services with support for both synchronous (single page) and asynchronous (multi-page) processing.

from elsai_ocr_extractors.amazon_boto3 import AmazonBoto3Connector

# Initialize the connector
client = AmazonBoto3Connector(
    access_key="your_aws_access_key",
    secret_key="your_aws_secret_key",
    session_token="your_session_token",
    region_name="us-east-1"
)

# S3 file processing - Synchronous
resp = client.sync_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# S3 file processing - Asynchronous
resp = client.async_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# Local file processing - Synchronous
resp = client.sync_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

# Local file processing - Asynchronous
resp = client.async_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

Required Environment Variables:

  • AWS_ACCESS_KEY_ID – Your AWS access key ID

  • AWS_SECRET_ACCESS_KEY – Your AWS secret access key

  • AWS_SESSION_TOKEN – Your AWS session token (for temporary credentials)

  • AWS_REGION – AWS region (e.g., us-east-1)

6. AmazonTextractor#

AmazonTextractor uses amazon-textract-textractor for AWS Textract integration with support for both synchronous and asynchronous processing.

from elsai_ocr_extractors.amazon_textractor import AmazonTextractor

# Initialize the extractor
client = AmazonTextractor(
    access_key="your_aws_access_key",
    secret_key="your_aws_secret_key",
    session_token="your_session_token",
    region_name="us-east-1"
)

# S3 file processing - Synchronous
resp = client.sync_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# S3 file processing - Asynchronous
resp = client.async_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# Local file processing - Synchronous
resp = client.sync_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

# Local file processing - Asynchronous
resp = client.async_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

Required Environment Variables:

  • AWS_ACCESS_KEY_ID – Your AWS access key ID

  • AWS_SECRET_ACCESS_KEY – Your AWS secret access key

  • AWS_SESSION_TOKEN – Your AWS session token (for temporary credentials)

  • AWS_REGION – AWS region (e.g., us-east-1)

7. Mistral OCR Extractor#

MistralOCR is a new addition to the package with comprehensive OCR capabilities including basic text extraction, annotations, and question-answering features.

import os
from pydantic import BaseModel, Field
from elsai_ocr_extractors.mistral_ocr import MistralOCR

# Optional: Define Pydantic models for annotations
class MyBBoxFormat(BaseModel):
    image_type: str = Field(..., description="Type of the image")
    summary: str = Field(..., description="Summary of image content")

class MyDocFormat(BaseModel):
    language: str = Field(..., description="Language of the document")
    chapter_titles: list[str] = Field(..., description="Chapter titles")
    urls: list[str] = Field(..., description="List of URLs in the document")

def test_mistral_ocr():
    print("🔧 Initializing MistralOCR...")
    ocr = MistralOCR(
        file_path="path/to/your/document.pdf",
        api_key="your_mistral_api_key"
    )

    # Basic OCR
    print("\n▶️ Running basic OCR...")
    ocr_response = ocr.extract()
    print("✅ OCR Response received.")
    print(str(ocr_response)[:500], "...\n")

    # OCR with annotations
    print("▶️ Running OCR with annotations...")
    annotated_response = ocr.extract(
        bbox_annotation_model=MyBBoxFormat,
        document_annotation_model=MyDocFormat
    )
    print("✅ Annotated OCR Response received.")
    print(str(annotated_response)[:500], "...\n")

    # Document QnA
    print("▶️ Asking question from document...")
    answer = ocr.ask_question("What is the last sentence in the document?")
    print("✅ QnA Response:")
    print(answer)

if __name__ == "__main__":
    try:
        test_mistral_ocr()
    except Exception as e:
        print("❌ Error during test:", e)

Required Environment Variable:

  • MISTRAL_API_KEY – Your Mistral AI API key