Elsai OCR Extractors v1.0.0#

The Elsai OCR Extractors package offers seamless integration with OCR services for extracting text from PDF documents with enhanced capabilities for v1.0.0:

Azure Cognitive Service
Azure Document Intelligence
Llama Parse
VisionAI
AmazonBoto3Connector (Uses boto3)
AmazonTextractor (Uses amazon-textract-textractor)
Mistral OCR

Prerequisites#

Python >= 3.9
.env file with appropriate API keys and configuration variables

Installation#

To install the elsai-ocr-extractors package:

pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-ocr-extractors/ elsai-ocr-extractors==1.0.0

Components#

1. AzureCognitiveService#

AzureCognitiveService is a class for extracting text from PDFs using Azure’s Cognitive Services. It supports both direct initialization and environment variable-based configuration.

from elsai_ocr_extractors.azure_cognitive_service import AzureCognitiveService

azure_ocr = AzureCognitiveService(
    file_path="your-file-path-here",
    subscription_key="your-subscription-key-here",
    endpoint="your-endpoint-here"
)  # Or set it as an environment variable

extracted_text = azure_ocr.extract_text_from_pdf()

Note

Enhanced Response with Metadata: In version 1.0.2, Azure Cognitive Service now returns text along with metadata, providing additional information about the extracted content such as language detection, confidence scores, and other contextual details.

Required Environment Variables:

AZURE_SUBSCRIPTION_KEY – Your Azure OCR subscription key
AZURE_ENDPOINT – Your Azure OCR endpoint

2. AzureDocumentIntelligence#

AzureDocumentIntelligence is a class that utilizes Azure’s advanced Document Intelligence to extract structured text from documents like PDFs. It now accepts additional keyword arguments for enhanced flexibility.

from elsai_ocr_extractors.azure_document_intelligence import AzureDocumentIntelligence

azure_ocr = AzureDocumentIntelligence(
    file_path="your-file-path",
    vision_endpoint="your-vision-endpoint",
    vision_key="your-vision-key"
)  # Or set it as an environment variable

extracted_text = azure_ocr.extract_text() #Specify page numbers if needed pages="1,3" -> Optional and also can send **kwargs for additional parameters
extracted_tables = azure_ocr.extract_tables() #Specify page numbers if needed pages="1,3" -> Optional and also can send **kwargs for additional parameters

Required Environment Variables:

VISION_ENDPOINT – Endpoint for Azure Document Intelligence
VISION_KEY – Subscription key for Azure Document Intelligence

3. Llama Parse Extractor#

LlamaParseExtractor is a wrapper class for interacting with the LlamaParse API to parse and load CSV data.

from elsai_ocr_extractors.llama_parse_extractor import LlamaParseExtractor

llama_parse_extractor = LlamaParseExtractor(api_key="llama_parse_api_key")
loaded_data = llama_parse_extractor.load_csv("path/to/your/file.csv")

4. VisionAI Extractor (Enhanced)#

VisionAIExtractor has been enhanced with new capabilities for v1.0.0:

New Features: - Support for synchronous and asynchronous extraction modes - Configurable batch size for large documents - Dynamic prompt updating before extraction

Note

Streaming Functionality: In version 1.0.1, VisionAI Extractor now supports streaming responses, allowing you to process PDF pages as they become available without waiting for the entire document to complete processing.

from elsai_ocr_extractors.visionai_extractor import VisionAIExtractor

# Initialize the extractor
vision_ai = VisionAIExtractor(
    api_key="your_openai_api_key",
    model_name="gpt-4o-mini"
)

# Synchronous extraction
response = vision_ai.extract_text_from_pdf(pdf_path="/path/to/your/document.pdf")
print(response[0].page_content)  # Print the content of the first page

# Asynchronous extraction with configurable batch size
async def extract_text():
    response = await vision_ai.extract_text_from_pdf_async(
        pdf_path="/path/to/your/document.pdf",
        batch_size=2  # Default batch size is None which processes all pages concurrently
    )
    print(response[0].page_content)

# Streaming extraction (available in v1.0.1) - process pages as they become available
async def extract_text_streaming():
    """Async function to process PDF pages as they become available."""
    async for document in vision_ai.extract_text_from_pdf_async(
        "/path/to/your/document.pdf",
        batch_size=2
    ):
        # Process each document as it becomes available
        print(f"Content preview: {document.page_content[:100]}...")  # First 100 chars

import asyncio
asyncio.run(extract_text())  # Run the async function to extract text
asyncio.run(extract_text_streaming())  # Run the streaming async function

Required Environment Variable:

OPENAI_API_KEY – Your OpenAI API key for Vision AI access

If you are facing Poppler issues while running VisionAI Extractor, ensure Poppler is installed:

Linux:

`bash sudo apt-get update sudo apt-get install poppler-utils `

Windows:

Download from: [oschwartz10612/poppler-windows) Extract and add the bin folder to your system PATH.

Example path: C:popplerbin

MacOS:

`bash brew install poppler `

5. AmazonBoto3Connector#

AmazonBoto3Connector uses boto3 for AWS services with support for both synchronous (single page) and asynchronous (multi-page) processing.

from elsai_ocr_extractors.amazon_boto3 import AmazonBoto3Connector

# Initialize the connector
client = AmazonBoto3Connector(
    access_key="your_aws_access_key",
    secret_key="your_aws_secret_key",
    session_token="your_session_token",
    region_name="us-east-1"
)

# S3 file processing - Synchronous
resp = client.sync_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# S3 file processing - Asynchronous
resp = client.async_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# Local file processing - Synchronous
resp = client.sync_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

# Local file processing - Asynchronous
resp = client.async_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

Required Environment Variables:

AWS_ACCESS_KEY_ID – Your AWS access key ID
AWS_SECRET_ACCESS_KEY – Your AWS secret access key
AWS_SESSION_TOKEN – Your AWS session token (for temporary credentials)
AWS_REGION – AWS region (e.g., us-east-1)

6. AmazonTextractor#

AmazonTextractor uses amazon-textract-textractor for AWS Textract integration with support for both synchronous and asynchronous processing.

from elsai_ocr_extractors.amazon_textractor import AmazonTextractor

# Initialize the extractor
client = AmazonTextractor(
    access_key="your_aws_access_key",
    secret_key="your_aws_secret_key",
    session_token="your_session_token",
    region_name="us-east-1"
)

# S3 file processing - Synchronous
resp = client.sync_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# S3 file processing - Asynchronous
resp = client.async_process_document(
    file_source="s3://voice-agent-upload/general/FRF.pdf",
    feature_list=["text","tables", "forms","layout"]
)
print(resp)

# Local file processing - Synchronous
resp = client.sync_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

# Local file processing - Asynchronous
resp = client.async_process_document(
    file_source="/path/to/your/document.pdf",
    feature_list=["text","tables", "forms","layout"],
    s3_bucket="voice-agent-upload",
    s3_folder="general"
)
print(resp)

Required Environment Variables:

AWS_ACCESS_KEY_ID – Your AWS access key ID
AWS_SECRET_ACCESS_KEY – Your AWS secret access key
AWS_SESSION_TOKEN – Your AWS session token (for temporary credentials)
AWS_REGION – AWS region (e.g., us-east-1)

7. Mistral OCR Extractor#

MistralOCR is a new addition to the package with comprehensive OCR capabilities including basic text extraction, annotations, and question-answering features.

import os
from pydantic import BaseModel, Field
from elsai_ocr_extractors.mistral_ocr import MistralOCR

# Optional: Define Pydantic models for annotations
class MyBBoxFormat(BaseModel):
    image_type: str = Field(..., description="Type of the image")
    summary: str = Field(..., description="Summary of image content")

class MyDocFormat(BaseModel):
    language: str = Field(..., description="Language of the document")
    chapter_titles: list[str] = Field(..., description="Chapter titles")
    urls: list[str] = Field(..., description="List of URLs in the document")

def test_mistral_ocr():
    print("🔧 Initializing MistralOCR...")
    ocr = MistralOCR(
        file_path="path/to/your/document.pdf",
        api_key="your_mistral_api_key"
    )

    # Basic OCR
    print("\n▶️ Running basic OCR...")
    ocr_response = ocr.extract()
    print("✅ OCR Response received.")
    print(str(ocr_response)[:500], "...\n")

    # OCR with annotations
    print("▶️ Running OCR with annotations...")
    annotated_response = ocr.extract(
        bbox_annotation_model=MyBBoxFormat,
        document_annotation_model=MyDocFormat
    )
    print("✅ Annotated OCR Response received.")
    print(str(annotated_response)[:500], "...\n")

    # Document QnA
    print("▶️ Asking question from document...")
    answer = ocr.ask_question("What is the last sentence in the document?")
    print("✅ QnA Response:")
    print(answer)

if __name__ == "__main__":
    try:
        test_mistral_ocr()
    except Exception as e:
        print("❌ Error during test:", e)

Required Environment Variable:

MISTRAL_API_KEY – Your Mistral AI API key