Elsai OCR Extractors#
The Elsai OCR Extractors package offers seamless integration with OCR services for extracting text from PDF documents. It currently supports:
Azure Cognitive Service
Azure Document Intelligence
Llama Parse
OpenAI VisionAI
Amazon Textract
Prerequisites#
Python >= 3.9
.env file with appropriate API keys and configuration variables
Installation#
To install the elsai-ocr-extractors package:
pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-ocr-extractors/ elsai-ocr-extractors==0.1.0
Components#
1. AzureCognitiveService#
AzureCognitiveService is a class for extracting text from PDFs using Azure’s Cognitive Services. It supports both direct initialization and environment variable-based configuration.
from elsai_ocr_extractors.azure_cognitive_service import AzureCognitiveService
azure_ocr = AzureCognitiveService(
file_path="your-file-path-here",
subscription_key="your-subscription-key-here",
endpoint="your-endpoint-here"
) # Or set it as an environment variable
extracted_text = azure_ocr.extract_text_from_pdf()
Required Environment Variables:
AZURE_SUBSCRIPTION_KEY
– Your Azure OCR subscription keyAZURE_ENDPOINT
– Your Azure OCR endpoint
2. AzureDocumentIntelligence#
AzureDocumentIntelligence is a class that utilizes Azure’s advanced Document Intelligence to extract structured text from documents like PDFs.
from elsai_ocr_extractors.azure_document_intelligence import AzureDocumentIntelligence
azure_ocr = AzureDocumentIntelligence(
file_path="your-file-path",
vision_endpoint="your-vision-endpoint",
vision_key="your-vision-key"
) # Or set it as an environment variable
extracted_text = azure_ocr.extract_text() #Specify page numbers if needed pages="1,3" -> Optional
extracted_tables = azure_ocr.extract_tables() #Specify page numbers if needed pages="1,3" -> Optional
Required Environment Variables:
VISION_ENDPOINT
– Endpoint for Azure Document IntelligenceVISION_KEY
– Subscription key for Azure Document Intelligence
3. Llama Parse Extractor#
LlamaParseExtractor is a wrapper class for interacting with the LlamaParse API to parse and load CSV data.
from elsai_ocr_extractors.llama_parse_extractor import LlamaParseExtractor
llama_parse_extractor = LlamaParseExtractor(api_key="llama_parse_api_key")
loaded_data = llama_parse_extractor.load_csv("path/to/your/file.csv")
4. VisionAI Extractor#
VisionAIExtractor uses OpenAI’s Vision AI to extract text from PDFs by converting pages to images and processing them using a language model.
from elsai_ocr_extractors.visionai_extractor import VisionAIExtractor
extractor = VisionAIExtractor(api_key="openai_key", model_name="gpt-4o") # Or set API key in env
documents = extractor.extract_text_from_pdf(pdf_path="path_to_pdf.pdf")
Required Environment Variable:
OPENAI_API_KEY
– Your secret API key from OpenAI used to authenticate and authorize API requests.
If you are facing Poppler issues while running VisionAI Extractor, ensure Poppler is installed:
Linux:
`bash
sudo apt-get update
sudo apt-get install poppler-utils
`
Windows:
Download from: [oschwartz10612/poppler-windows) Extract and add the bin folder to your system PATH.
Example path: C:popplerbin
MacOS:
`bash
brew install poppler
`
5. Amazon Textract#
AwsTextractConnector interfaces with AWS Textract to extract text from PDFs. It supports multiple modes of operation including async document processing.
from elsai_ocr_extractors.awstextract import AwsTextractConnector
textract_connector = AwsTextractConnector(
access_key="aws_access_key",
secret_key="aws_secret_key",
session_token="aws_session_token",
region_name="aws_region",
) # Or set environment variables
# Extract text from local PDF
documents = textract_connector.extract_text(file_path="path_to_pdf.pdf", s3_bucket="your_s3_bucket_name", s3_folder="your_s3_folder_name (OPTIONAL)")
# Extract text from S3
documents = textract_connector.extract_text_from_s3(s3_url="s3://bucket_name/folder_name/file.pdf")
# Extract text with features
documents = textract_connector.extract_text_features_from_s3(
s3_url="s3://bucket_name/folder_name/file.pdf",
features=["TABLES", "FORMS", "TEXT"]
)
# Asynchronous processing
documents = textract_connector.async_process_document(
s3_url="s3://bucket_name/folder_name/file.pdf",
feature_list=["tables", "forms"]
)
Required Environment Variables:
AWS_ACCESS_KEY_ID
– Your AWS access key ID used to authenticate API requests.AWS_SECRET_ACCESS_KEY
– Your AWS secret key paired with the access key for secure authentication.AWS_SESSION_TOKEN
– Temporary session token for using AWS STS or temporary credentials.AWS_REGION
– The AWS region where your resources (e.g., S3 bucket) are located.S3_BUCKET
– The name of your S3 bucket used for file uploads/downloads (REQUIRED ONLY FOR extract_text()).S3_FOLDER
– The folder path inside the S3 bucket for organizing files. (REQUIRED ONLY FOR extract_text()).