Skip to content

Elsai Speech Services

Package: elsai-stt  v0.1.0

Audio processing capabilities including Speech-to-Text (STT), Text-to-Speech (TTS), and end-to-end Speech-to-Speech conversion using Azure OpenAI Whisper and TTS models.

Installation

bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-stt/ elsai-stt==0.1.0

Requirements: Python >= 3.9, openai, pydub, numpy, python-dotenv


Available classes

ClassImport pathPurpose
AzureOpenAIWhisperelsai_stt.stt.azure_openaiSpeech-to-Text transcription
AzureOpenAITTSelsai_stt.tts.azure_openaiText-to-Speech synthesis
AzureOpenAISpeechToSpeechelsai_stt.s2s.azure_openaiEnd-to-end Speech-to-Speech pipeline

AzureOpenAIWhisper — Speech-to-Text

Transcribes audio files to text using Azure's hosted OpenAI Whisper model.

python
from elsai_stt.stt.azure_openai import AzureOpenAIWhisper

whisper = AzureOpenAIWhisper(
    endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    api_version="2024-02-01",
    deployment_id="whisper",
)

# Transcribe an audio file
transcript = whisper.transcribe_audio(file_path="meeting_recording.mp3")
print(transcript)

Constructor parameters:

ParameterDescription
endpointAzure OpenAI service endpoint URL
api_keyAzure OpenAI API key
api_versionAPI version (e.g. "2024-02-01")
deployment_idWhisper deployment name in your Azure resource

Methods:

MethodDescription
transcribe_audio(file_path)Transcribes the audio file at file_path and returns the transcribed text as a string

Environment variables: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_API_VERSION, AZURE_OPENAI_DEPLOYMENT_ID


AzureOpenAITTS — Text-to-Speech

Converts text to speech using Azure OpenAI TTS models.

Available voices: alloy, echo, fable, onyx, nova, shimmer

Supported audio formats: mp3, opus, aac, flac, wav, pcm

python
from elsai_stt.tts.azure_openai import AzureOpenAITTS

tts = AzureOpenAITTS(
    endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    api_version="2024-02-01",
    deployment_id="tts",
)

# Generate speech and save to file
output_path = tts.text_to_speech(
    text="Hello! Welcome to Elsai.",
    voice="alloy",
    format="mp3",
    speed=1.0,
    save_to="output.mp3",
)
print(output_path)

Constructor parameters:

ParameterDescription
endpointAzure OpenAI service endpoint URL
api_keyAzure OpenAI API key
api_versionAPI version (e.g. "2024-02-01")
deployment_idTTS deployment name in your Azure resource

text_to_speech() parameters:

ParameterDescription
textThe text string to synthesize
voiceVoice to use: "alloy", "echo", "fable", "onyx", "nova", or "shimmer"
formatOutput audio format: "mp3", "opus", "aac", "flac", "wav", or "pcm"
speedPlayback speed — 0.25 (slowest) to 4.0 (fastest); 1.0 is normal
save_toFile path to save the generated audio

Returns the path to the saved audio file.


AzureOpenAISpeechToSpeech — End-to-end pipeline

Combines STT and TTS in a single class for voice-in / voice-out workflows. Supports two initialization modes: shared Azure resource (Whisper and TTS on the same deployment) or separate resources.

Shared resource

python
from elsai_stt.s2s.azure_openai import AzureOpenAISpeechToSpeech

s2s = AzureOpenAISpeechToSpeech(
    endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    api_version="2024-02-01",
    whisper_deployment_id="whisper",
    tts_deployment_id="tts",
)

Separate resources

python
s2s = AzureOpenAISpeechToSpeech(
    whisper_endpoint="https://your-whisper-resource.openai.azure.com/",
    whisper_api_key="your-whisper-api-key",
    whisper_api_version="2024-02-01",
    whisper_deployment_id="whisper",
    tts_endpoint="https://your-tts-resource.openai.azure.com/",
    tts_api_key="your-tts-api-key",
    tts_api_version="2024-02-01",
    tts_deployment_id="tts",
)

Transcribe audio

python
# From a file path
text = s2s.transcribe_audio(
    file_path="user_question.mp3",
    output_format="mp3",
    sample_rate=16000,
)
print(text)

transcribe_audio() parameters:

ParameterDescription
file_pathPath to the input audio file (optional if buffer is provided)
bufferAudio data as bytes buffer (optional if file_path is provided)
output_formatAudio format for any intermediate processing
sample_rateSample rate for the audio

Synthesize speech

python
# Save to file
output_path = s2s.synthesize_speech(
    text="Here is your answer.",
    voice="nova",
    response_format="mp3",
    speed=1.0,
    output_path="response.mp3",
)

# Return as bytes buffer
audio_bytes = s2s.synthesize_speech(
    text="Here is your answer.",
    voice="nova",
    return_buffer=True,
)

synthesize_speech() parameters:

ParameterDescription
textText to convert to speech
voiceVoice to use (e.g. "alloy", "nova", "shimmer")
response_formatOutput audio format (e.g. "mp3", "wav")
speedPlayback speed — 0.25 to 4.0
output_pathFile path to save the audio (optional)
return_bufferIf True, returns audio as bytes instead of saving to file

Full Speech-to-Speech

Process an audio input, optionally transform the transcript with a callable, and return synthesized audio:

python
# Basic — transcribe input, speak the transcript back
result = s2s.process_speech_to_speech(
    input_file_path="user_question.mp3",
    output_voice="nova",
    output_format="mp3",
    output_path="agent_response.mp3",
)

# With a text processor (e.g. send transcript through an LLM)
def answer_with_llm(transcript: str) -> str:
    return my_agent(transcript)

result = s2s.process_speech_to_speech(
    input_file_path="user_question.mp3",
    text_processor=answer_with_llm,
    return_transcription=True,    # also return the input transcript
    output_voice="nova",
    output_format="mp3",
    output_path="agent_response.mp3",
)

audio, transcript = result

process_speech_to_speech() parameters:

ParameterDescription
input_file_pathPath to the input audio file (optional if input_buffer provided)
input_bufferAudio data as bytes (optional if input_file_path provided)
text_processorOptional callable that receives the transcript and returns a response string — use to inject an LLM or agent
return_transcriptionIf True, returns a (audio, transcript) tuple instead of just the audio
output_voiceTTS voice for the response
output_formatOutput audio format
output_speedPlayback speed for the synthesized response
output_pathFile path to save the response audio

Return value: audio file path or bytes; a (audio, transcript) tuple when return_transcription=True.


Version history

VersionChanges
0.1.0Initial release — AzureOpenAIWhisper, AzureOpenAITTS, AzureOpenAISpeechToSpeech

Copyright © 2026 Elsai Foundry.