Elsai Speech Services#
The Elsai Speech Services package provides comprehensive audio processing capabilities including Speech-to-Text (STT), Text-to-Speech (TTS), and Speech-to-Speech conversion using Azure OpenAI Whisper and TTS models.
Prerequisites#
Python >= 3.9
.env file with Azure OpenAI credentials
Required Python packages: openai, pydub, numpy, python-dotenv
Installation#
To install the elsai-stt package:
pip install --index-url https://elsai-core-package.optisolbusiness.com/root/elsai-stt/ elsai-stt==0.1.0
Components#
1. AzureOpenAIWhisper (Speech-to-Text)#
AzureOpenAIWhisper is a class used to transcribe audio using Azure’s hosted version of OpenAI Whisper. You can provide credentials directly or through environment variables.
from elsai_stt.stt.azure_openai import AzureOpenAIWhisper
whisper = AzureOpenAIWhisper(
api_version="your_api_version",
endpoint="azure_whisper_endpoint",
api_key="api_key",
deployment_id="deployment_id"
) # Or set via environment variables
# Path to the audio file
test_file = "harvard.wav"
# Transcribe audio
result = whisper.transcribe_audio(file_path=test_file)
Required Environment Variables for STT:
AZURE_OPENAI_API_VERSION
– Version of the Azure OpenAI APIAZURE_OPENAI_ENDPOINT
– Endpoint URL for Azure Whisper deploymentAZURE_OPENAI_API_KEY
– API key for authenticating requestsAZURE_OPENAI_DEPLOYMENT_ID
– Deployment ID for the Whisper model
2. AzureOpenAITTS (Text-to-Speech)#
AzureOpenAITTS provides text-to-speech conversion using Azure OpenAI’s TTS models.
from elsai_stt.tts.azure_openai import AzureOpenAITTS
tts = AzureOpenAITTS(
api_key="your_api_key",
api_version="your_api_version",
endpoint="your_endpoint",
deployment_id="tts_deployment_id"
)
# Convert text to speech
audio_path = tts.text_to_speech(
text="Hello, this is a test of text-to-speech conversion.",
voice="alloy",
format="mp3",
speed=1.0,
save_to="output.mp3"
)
TTS Configuration Options:
Voices:
alloy
,echo
,fable
,onyx
,nova
,shimmer
Formats:
mp3
,opus
,aac
,flac
,wav
,pcm
Speed: 0.25 to 4.0 (1.0 is normal speed)
3. AzureOpenAISpeechToSpeech (Complete Pipeline)#
AzureOpenAISpeechToSpeech combines STT and TTS for complete speech-to-speech conversion, supporting both shared and separate Azure resources.
Shared Configuration (Same Azure Resource):
from elsai_stt.s2s.azure_openai import AzureOpenAISpeechToSpeech
speech_service = AzureOpenAISpeechToSpeech(
api_key="shared_api_key",
api_version="2024-06-01",
endpoint="https://your-resource.openai.azure.com/",
whisper_deployment_id="whisper-deployment",
tts_deployment_id="tts-deployment"
)
Separate Configuration (Different Azure Resources):
speech_service = AzureOpenAISpeechToSpeech(
whisper_api_key="whisper_api_key",
whisper_api_version="2024-06-01",
whisper_endpoint="https://whisper-resource.openai.azure.com/",
whisper_deployment_id="whisper-deployment",
tts_api_key="tts_api_key",
tts_api_version="2024-06-01",
tts_endpoint="https://tts-resource.openai.azure.com/",
tts_deployment_id="tts-deployment"
)
Usage Examples:
# Speech-to-Text only
transcription = speech_service.transcribe_audio(file_path="input.wav")
print(f"Transcribed: {transcription}")
# Text-to-Speech only
audio_file = speech_service.synthesize_speech(
text="Hello world",
voice="nova",
output_format="mp3"
)
# Complete Speech-to-Speech conversion
result = speech_service.process_speech_to_speech(
input_file_path="input.wav",
output_voice="alloy",
output_format="mp3",
output_path="output.mp3",
return_transcription=True
)
# result is a tuple: (audio_path, original_transcription, processed_text)
audio_path, transcription, processed_text = result
Advanced Speech-to-Speech with Text Processing:
def text_processor(text):
# Custom text processing (e.g., translation, summarization)
return f"Processed: {text.upper()}"
result = speech_service.process_speech_to_speech(
input_file_path="input.wav",
text_processor=text_processor,
output_voice="echo",
output_format="wav",
return_transcription=True
)
Processing Audio Buffers:
import numpy as np
# Example: processing real-time audio buffers
audio_buffers = [np.random.randn(1024) for _ in range(10)]
result = speech_service.process_speech_to_speech(
input_buffer=audio_buffers,
input_format="wav",
input_sample_rate=16000,
output_voice="shimmer",
output_format="mp3"
)
Method Reference#
AzureOpenAISpeechToSpeech Methods#
transcribe_audio(file_path=None, buffer=None, output_format=”webm”, sample_rate=24000)
Transcribes audio from file or buffer to text.
- Args:
file_path
(str, optional): Path to audio filebuffer
(List[np.ndarray], optional): Audio buffer chunksoutput_format
(str): Format for buffer processing (“webm”, “wav”, “mp3”)sample_rate
(int): Sample rate for buffer processing
- Returns:
str
: Transcribed text
synthesize_speech(text, voice=”alloy”, response_format=”mp3”, speed=1.0, output_path=None, return_buffer=False)
Converts text to speech.
- Args:
text
(str): Text to convertvoice
(str): Voice selectionresponse_format
(str): Audio formatspeed
(float): Speech speed (0.25-4.0)output_path
(str, optional): Save locationreturn_buffer
(bool): Return bytes instead of file path
- Returns:
Union[str, bytes, None]
: File path, audio bytes, or None
process_speech_to_speech(…)
Complete speech-to-speech pipeline with extensive configuration options.
- Key Args:
input_file_path
/input_buffer
: Audio inputtext_processor
(callable, optional): Function to process transcribed textreturn_transcription
(bool): Include transcription in returnoutput_voice
,output_format
,output_speed
: TTS configuration
- Returns:
Union[str, bytes, tuple, None]
: Based on return flags