Elsai Speech Services#
The Elsai Speech Services package provides comprehensive audio processing capabilities including Speech-to-Text (STT), Text-to-Speech (TTS), and Speech-to-Speech conversion using Azure OpenAI Whisper and TTS models.
Prerequisites#
- Python >= 3.9 
- .env file with Azure OpenAI credentials 
- Required Python packages: openai, pydub, numpy, python-dotenv 
Installation#
To install the elsai-stt package:
pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-stt/ elsai-stt==0.1.0
Components#
1. AzureOpenAIWhisper (Speech-to-Text)#
AzureOpenAIWhisper is a class used to transcribe audio using Azure’s hosted version of OpenAI Whisper. You can provide credentials directly or through environment variables.
from elsai_stt.stt.azure_openai import AzureOpenAIWhisper
whisper = AzureOpenAIWhisper(
    api_version="your_api_version",
    endpoint="azure_whisper_endpoint",
    api_key="api_key",
    deployment_id="deployment_id"
)  # Or set via environment variables
# Path to the audio file
test_file = "harvard.wav"
# Transcribe audio
result = whisper.transcribe_audio(file_path=test_file)
Required Environment Variables for STT:
- AZURE_OPENAI_API_VERSION– Version of the Azure OpenAI API
- AZURE_OPENAI_ENDPOINT– Endpoint URL for Azure Whisper deployment
- AZURE_OPENAI_API_KEY– API key for authenticating requests
- AZURE_OPENAI_DEPLOYMENT_ID– Deployment ID for the Whisper model
2. AzureOpenAITTS (Text-to-Speech)#
AzureOpenAITTS provides text-to-speech conversion using Azure OpenAI’s TTS models.
from elsai_stt.tts.azure_openai import AzureOpenAITTS
tts = AzureOpenAITTS(
    api_key="your_api_key",
    api_version="your_api_version",
    endpoint="your_endpoint",
    deployment_id="tts_deployment_id"
)
# Convert text to speech
audio_path = tts.text_to_speech(
    text="Hello, this is a test of text-to-speech conversion.",
    voice="alloy",
    format="mp3",
    speed=1.0,
    save_to="output.mp3"
)
TTS Configuration Options:
- Voices: - alloy,- echo,- fable,- onyx,- nova,- shimmer
- Formats: - mp3,- opus,- aac,- flac,- wav,- pcm
- Speed: 0.25 to 4.0 (1.0 is normal speed) 
3. AzureOpenAISpeechToSpeech (Complete Pipeline)#
AzureOpenAISpeechToSpeech combines STT and TTS for complete speech-to-speech conversion, supporting both shared and separate Azure resources.
Shared Configuration (Same Azure Resource):
from elsai_stt.s2s.azure_openai import AzureOpenAISpeechToSpeech
speech_service = AzureOpenAISpeechToSpeech(
    api_key="shared_api_key",
    api_version="2024-06-01",
    endpoint="https://your-resource.openai.azure.com/",
    whisper_deployment_id="whisper-deployment",
    tts_deployment_id="tts-deployment"
)
Separate Configuration (Different Azure Resources):
speech_service = AzureOpenAISpeechToSpeech(
    whisper_api_key="whisper_api_key",
    whisper_api_version="2024-06-01",
    whisper_endpoint="https://whisper-resource.openai.azure.com/",
    whisper_deployment_id="whisper-deployment",
    tts_api_key="tts_api_key",
    tts_api_version="2024-06-01",
    tts_endpoint="https://tts-resource.openai.azure.com/",
    tts_deployment_id="tts-deployment"
)
Usage Examples:
# Speech-to-Text only
transcription = speech_service.transcribe_audio(file_path="input.wav")
print(f"Transcribed: {transcription}")
# Text-to-Speech only
audio_file = speech_service.synthesize_speech(
    text="Hello world",
    voice="nova",
    output_format="mp3"
)
# Complete Speech-to-Speech conversion
result = speech_service.process_speech_to_speech(
    input_file_path="input.wav",
    output_voice="alloy",
    output_format="mp3",
    output_path="output.mp3",
    return_transcription=True
)
# result is a tuple: (audio_path, original_transcription, processed_text)
audio_path, transcription, processed_text = result
Advanced Speech-to-Speech with Text Processing:
def text_processor(text):
    # Custom text processing (e.g., translation, summarization)
    return f"Processed: {text.upper()}"
result = speech_service.process_speech_to_speech(
    input_file_path="input.wav",
    text_processor=text_processor,
    output_voice="echo",
    output_format="wav",
    return_transcription=True
)
Processing Audio Buffers:
import numpy as np
# Example: processing real-time audio buffers
audio_buffers = [np.random.randn(1024) for _ in range(10)]
result = speech_service.process_speech_to_speech(
    input_buffer=audio_buffers,
    input_format="wav",
    input_sample_rate=16000,
    output_voice="shimmer",
    output_format="mp3"
)
Method Reference#
AzureOpenAISpeechToSpeech Methods#
transcribe_audio(file_path=None, buffer=None, output_format=”webm”, sample_rate=24000)
Transcribes audio from file or buffer to text.
- Args:
- file_path(str, optional): Path to audio file
- buffer(List[np.ndarray], optional): Audio buffer chunks
- output_format(str): Format for buffer processing (“webm”, “wav”, “mp3”)
- sample_rate(int): Sample rate for buffer processing
 
- Returns:
- str: Transcribed text
 
synthesize_speech(text, voice=”alloy”, response_format=”mp3”, speed=1.0, output_path=None, return_buffer=False)
Converts text to speech.
- Args:
- text(str): Text to convert
- voice(str): Voice selection
- response_format(str): Audio format
- speed(float): Speech speed (0.25-4.0)
- output_path(str, optional): Save location
- return_buffer(bool): Return bytes instead of file path
 
- Returns:
- Union[str, bytes, None]: File path, audio bytes, or None
 
process_speech_to_speech(…)
Complete speech-to-speech pipeline with extensive configuration options.
- Key Args:
- input_file_path/- input_buffer: Audio input
- text_processor(callable, optional): Function to process transcribed text
- return_transcription(bool): Include transcription in return
- output_voice,- output_format,- output_speed: TTS configuration
 
- Returns:
- Union[str, bytes, tuple, None]: Based on return flags
 
