Elsai OCR Extractors v1.0.0#
The Elsai OCR Extractors package offers seamless integration with OCR services for extracting text from PDF documents with enhanced capabilities for v1.0.0:
Azure Cognitive Service
Azure Document Intelligence
Llama Parse
VisionAI
AmazonBoto3Connector (Uses boto3)
AmazonTextractor (Uses amazon-textract-textractor)
Mistral OCR
Prerequisites#
Python >= 3.9
.env file with appropriate API keys and configuration variables
Installation#
To install the elsai-ocr-extractors package:
pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-ocr-extractors/ elsai-ocr-extractors==1.0.0
Components#
1. AzureCognitiveService#
AzureCognitiveService is a class for extracting text from PDFs using Azure’s Cognitive Services. It supports both direct initialization and environment variable-based configuration.
from elsai_ocr_extractors.azure_cognitive_service import AzureCognitiveService
azure_ocr = AzureCognitiveService(
file_path="your-file-path-here",
subscription_key="your-subscription-key-here",
endpoint="your-endpoint-here"
) # Or set it as an environment variable
extracted_text = azure_ocr.extract_text_from_pdf()
Required Environment Variables:
AZURE_SUBSCRIPTION_KEY
– Your Azure OCR subscription keyAZURE_ENDPOINT
– Your Azure OCR endpoint
2. AzureDocumentIntelligence#
AzureDocumentIntelligence is a class that utilizes Azure’s advanced Document Intelligence to extract structured text from documents like PDFs. It now accepts additional keyword arguments for enhanced flexibility.
from elsai_ocr_extractors.azure_document_intelligence import AzureDocumentIntelligence
azure_ocr = AzureDocumentIntelligence(
file_path="your-file-path",
vision_endpoint="your-vision-endpoint",
vision_key="your-vision-key"
) # Or set it as an environment variable
extracted_text = azure_ocr.extract_text() #Specify page numbers if needed pages="1,3" -> Optional and also can send **kwargs for additional parameters
extracted_tables = azure_ocr.extract_tables() #Specify page numbers if needed pages="1,3" -> Optional and also can send **kwargs for additional parameters
Required Environment Variables:
VISION_ENDPOINT
– Endpoint for Azure Document IntelligenceVISION_KEY
– Subscription key for Azure Document Intelligence
3. Llama Parse Extractor#
LlamaParseExtractor is a wrapper class for interacting with the LlamaParse API to parse and load CSV data.
from elsai_ocr_extractors.llama_parse_extractor import LlamaParseExtractor
llama_parse_extractor = LlamaParseExtractor(api_key="llama_parse_api_key")
loaded_data = llama_parse_extractor.load_csv("path/to/your/file.csv")
4. VisionAI Extractor (Enhanced)#
VisionAIExtractor has been enhanced with new capabilities for v1.0.0:
New Features: - Support for synchronous and asynchronous extraction modes - Configurable batch size for large documents - Dynamic prompt updating before extraction
from elsai_ocr_extractors.visionai_extractor import VisionAIExtractor
# Initialize the extractor
vision_ai = VisionAIExtractor(
api_key="your_openai_api_key",
model_name="gpt-4o-mini"
)
# Synchronous extraction
response = vision_ai.extract_text_from_pdf(pdf_path="/path/to/your/document.pdf")
print(response[0].page_content) # Print the content of the first page
# Asynchronous extraction with configurable batch size
async def extract_text():
response = await vision_ai.extract_text_from_pdf_async(
pdf_path="/path/to/your/document.pdf",
batch_size=2 # Default batch size is None which processes all pages concurrently
)
print(response[0].page_content)
import asyncio
asyncio.run(extract_text()) # Run the async function to extract text
Required Environment Variable:
OPENAI_API_KEY
– Your OpenAI API key for Vision AI access
5. AmazonBoto3Connector#
AmazonBoto3Connector uses boto3 for AWS services with support for both synchronous (single page) and asynchronous (multi-page) processing.
from elsai_ocr_extractors.amazon_boto3 import AmazonBoto3Connector
# Initialize the connector
client = AmazonBoto3Connector(
access_key="your_aws_access_key",
secret_key="your_aws_secret_key",
session_token="your_session_token",
region_name="us-east-1"
)
# S3 file processing - Synchronous
resp = client.sync_process_document(
file_source="s3://voice-agent-upload/general/FRF.pdf",
feature_list=["text","tables", "forms","layout"]
)
print(resp)
# S3 file processing - Asynchronous
resp = client.async_process_document(
file_source="s3://voice-agent-upload/general/FRF.pdf",
feature_list=["text","tables", "forms","layout"]
)
print(resp)
# Local file processing - Synchronous
resp = client.sync_process_document(
file_source="/path/to/your/document.pdf",
feature_list=["text","tables", "forms","layout"],
s3_bucket="voice-agent-upload",
s3_folder="general"
)
print(resp)
# Local file processing - Asynchronous
resp = client.async_process_document(
file_source="/path/to/your/document.pdf",
feature_list=["text","tables", "forms","layout"],
s3_bucket="voice-agent-upload",
s3_folder="general"
)
print(resp)
Required Environment Variables:
AWS_ACCESS_KEY_ID
– Your AWS access key IDAWS_SECRET_ACCESS_KEY
– Your AWS secret access keyAWS_SESSION_TOKEN
– Your AWS session token (for temporary credentials)AWS_REGION
– AWS region (e.g., us-east-1)
6. AmazonTextractor#
AmazonTextractor uses amazon-textract-textractor for AWS Textract integration with support for both synchronous and asynchronous processing.
from elsai_ocr_extractors.amazon_textractor import AmazonTextractor
# Initialize the extractor
client = AmazonTextractor(
access_key="your_aws_access_key",
secret_key="your_aws_secret_key",
session_token="your_session_token",
region_name="us-east-1"
)
# S3 file processing - Synchronous
resp = client.sync_process_document(
file_source="s3://voice-agent-upload/general/FRF.pdf",
feature_list=["text","tables", "forms","layout"]
)
print(resp)
# S3 file processing - Asynchronous
resp = client.async_process_document(
file_source="s3://voice-agent-upload/general/FRF.pdf",
feature_list=["text","tables", "forms","layout"]
)
print(resp)
# Local file processing - Synchronous
resp = client.sync_process_document(
file_source="/path/to/your/document.pdf",
feature_list=["text","tables", "forms","layout"],
s3_bucket="voice-agent-upload",
s3_folder="general"
)
print(resp)
# Local file processing - Asynchronous
resp = client.async_process_document(
file_source="/path/to/your/document.pdf",
feature_list=["text","tables", "forms","layout"],
s3_bucket="voice-agent-upload",
s3_folder="general"
)
print(resp)
Required Environment Variables:
AWS_ACCESS_KEY_ID
– Your AWS access key IDAWS_SECRET_ACCESS_KEY
– Your AWS secret access keyAWS_SESSION_TOKEN
– Your AWS session token (for temporary credentials)AWS_REGION
– AWS region (e.g., us-east-1)
7. Mistral OCR Extractor#
MistralOCR is a new addition to the package with comprehensive OCR capabilities including basic text extraction, annotations, and question-answering features.
import os
from pydantic import BaseModel, Field
from elsai_ocr_extractors.mistral_ocr import MistralOCR
# Optional: Define Pydantic models for annotations
class MyBBoxFormat(BaseModel):
image_type: str = Field(..., description="Type of the image")
summary: str = Field(..., description="Summary of image content")
class MyDocFormat(BaseModel):
language: str = Field(..., description="Language of the document")
chapter_titles: list[str] = Field(..., description="Chapter titles")
urls: list[str] = Field(..., description="List of URLs in the document")
def test_mistral_ocr():
print("🔧 Initializing MistralOCR...")
ocr = MistralOCR(
file_path="path/to/your/document.pdf",
api_key="your_mistral_api_key"
)
# Basic OCR
print("\n▶️ Running basic OCR...")
ocr_response = ocr.extract()
print("✅ OCR Response received.")
print(str(ocr_response)[:500], "...\n")
# OCR with annotations
print("▶️ Running OCR with annotations...")
annotated_response = ocr.extract(
bbox_annotation_model=MyBBoxFormat,
document_annotation_model=MyDocFormat
)
print("✅ Annotated OCR Response received.")
print(str(annotated_response)[:500], "...\n")
# Document QnA
print("▶️ Asking question from document...")
answer = ocr.ask_question("What is the last sentence in the document?")
print("✅ QnA Response:")
print(answer)
if __name__ == "__main__":
try:
test_mistral_ocr()
except Exception as e:
print("❌ Error during test:", e)
Required Environment Variable:
MISTRAL_API_KEY
– Your Mistral AI API key