Elsai Speech Services#

The Elsai Speech Services package provides comprehensive audio processing capabilities including Speech-to-Text (STT), Text-to-Speech (TTS), and Speech-to-Speech conversion using Azure OpenAI Whisper and TTS models.

Prerequisites#

Python >= 3.9
.env file with Azure OpenAI credentials
Required Python packages: openai, pydub, numpy, python-dotenv

Installation#

To install the elsai-stt package:

pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-stt/ elsai-stt==0.1.0

Components#

1. AzureOpenAIWhisper (Speech-to-Text)#

AzureOpenAIWhisper is a class used to transcribe audio using Azure’s hosted version of OpenAI Whisper. You can provide credentials directly or through environment variables.

from elsai_stt.stt.azure_openai import AzureOpenAIWhisper

whisper = AzureOpenAIWhisper(
    api_version="your_api_version",
    endpoint="azure_whisper_endpoint",
    api_key="api_key",
    deployment_id="deployment_id"
)  # Or set via environment variables

# Path to the audio file
test_file = "harvard.wav"

# Transcribe audio
result = whisper.transcribe_audio(file_path=test_file)

Required Environment Variables for STT:

AZURE_OPENAI_API_VERSION – Version of the Azure OpenAI API
AZURE_OPENAI_ENDPOINT – Endpoint URL for Azure Whisper deployment
AZURE_OPENAI_API_KEY – API key for authenticating requests
AZURE_OPENAI_DEPLOYMENT_ID – Deployment ID for the Whisper model

2. AzureOpenAITTS (Text-to-Speech)#

AzureOpenAITTS provides text-to-speech conversion using Azure OpenAI’s TTS models.

from elsai_stt.tts.azure_openai import AzureOpenAITTS

tts = AzureOpenAITTS(
    api_key="your_api_key",
    api_version="your_api_version",
    endpoint="your_endpoint",
    deployment_id="tts_deployment_id"
)

# Convert text to speech
audio_path = tts.text_to_speech(
    text="Hello, this is a test of text-to-speech conversion.",
    voice="alloy",
    format="mp3",
    speed=1.0,
    save_to="output.mp3"
)

TTS Configuration Options:

Voices: alloy, echo, fable, onyx, nova, shimmer
Formats: mp3, opus, aac, flac, wav, pcm
Speed: 0.25 to 4.0 (1.0 is normal speed)

3. AzureOpenAISpeechToSpeech (Complete Pipeline)#

AzureOpenAISpeechToSpeech combines STT and TTS for complete speech-to-speech conversion, supporting both shared and separate Azure resources.

Shared Configuration (Same Azure Resource):

from elsai_stt.s2s.azure_openai import AzureOpenAISpeechToSpeech

speech_service = AzureOpenAISpeechToSpeech(
    api_key="shared_api_key",
    api_version="2024-06-01",
    endpoint="https://your-resource.openai.azure.com/",
    whisper_deployment_id="whisper-deployment",
    tts_deployment_id="tts-deployment"
)

Separate Configuration (Different Azure Resources):

speech_service = AzureOpenAISpeechToSpeech(
    whisper_api_key="whisper_api_key",
    whisper_api_version="2024-06-01",
    whisper_endpoint="https://whisper-resource.openai.azure.com/",
    whisper_deployment_id="whisper-deployment",
    tts_api_key="tts_api_key",
    tts_api_version="2024-06-01",
    tts_endpoint="https://tts-resource.openai.azure.com/",
    tts_deployment_id="tts-deployment"
)

Usage Examples:

# Speech-to-Text only
transcription = speech_service.transcribe_audio(file_path="input.wav")
print(f"Transcribed: {transcription}")

# Text-to-Speech only
audio_file = speech_service.synthesize_speech(
    text="Hello world",
    voice="nova",
    output_format="mp3"
)

# Complete Speech-to-Speech conversion
result = speech_service.process_speech_to_speech(
    input_file_path="input.wav",
    output_voice="alloy",
    output_format="mp3",
    output_path="output.mp3",
    return_transcription=True
)
# result is a tuple: (audio_path, original_transcription, processed_text)
audio_path, transcription, processed_text = result

Advanced Speech-to-Speech with Text Processing:

def text_processor(text):
    # Custom text processing (e.g., translation, summarization)
    return f"Processed: {text.upper()}"

result = speech_service.process_speech_to_speech(
    input_file_path="input.wav",
    text_processor=text_processor,
    output_voice="echo",
    output_format="wav",
    return_transcription=True
)

Processing Audio Buffers:

import numpy as np

# Example: processing real-time audio buffers
audio_buffers = [np.random.randn(1024) for _ in range(10)]
result = speech_service.process_speech_to_speech(
    input_buffer=audio_buffers,
    input_format="wav",
    input_sample_rate=16000,
    output_voice="shimmer",
    output_format="mp3"
)

Method Reference#

AzureOpenAISpeechToSpeech Methods#

transcribe_audio(file_path=None, buffer=None, output_format=”webm”, sample_rate=24000)

Transcribes audio from file or buffer to text.

Args:

file_path (str, optional): Path to audio file
buffer (List[np.ndarray], optional): Audio buffer chunks
output_format (str): Format for buffer processing (“webm”, “wav”, “mp3”)
sample_rate (int): Sample rate for buffer processing

Returns:

str: Transcribed text

synthesize_speech(text, voice=”alloy”, response_format=”mp3”, speed=1.0, output_path=None, return_buffer=False)

Converts text to speech.

Args:

text (str): Text to convert
voice (str): Voice selection
response_format (str): Audio format
speed (float): Speech speed (0.25-4.0)
output_path (str, optional): Save location
return_buffer (bool): Return bytes instead of file path

Returns:

Union[str, bytes, None]: File path, audio bytes, or None

process_speech_to_speech(…)

Complete speech-to-speech pipeline with extensive configuration options.

Key Args:

input_file_path / input_buffer: Audio input
text_processor (callable, optional): Function to process transcribed text
return_transcription (bool): Include transcription in return
output_voice, output_format, output_speed: TTS configuration

Returns:

Union[str, bytes, tuple, None]: Based on return flags