elsai Speech Services

Package: elsai-stt v0.1.0

Audio processing capabilities including Speech-to-Text (STT), Text-to-Speech (TTS), and end-to-end Speech-to-Speech conversion using Azure OpenAI Whisper and TTS models.

Installation

bash

pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-stt/ elsai-stt==0.1.0

Requirements: Python >= 3.9, openai, pydub, numpy, python-dotenv

Available classes

Class	Import path	Purpose
`AzureOpenAIWhisper`	`elsai_stt.stt.azure_openai`	Speech-to-Text transcription
`AzureOpenAITTS`	`elsai_stt.tts.azure_openai`	Text-to-Speech synthesis
`AzureOpenAISpeechToSpeech`	`elsai_stt.s2s.azure_openai`	End-to-end Speech-to-Speech pipeline

AzureOpenAIWhisper — Speech-to-Text

Transcribes audio files to text using Azure's hosted OpenAI Whisper model.

python

from elsai_stt.stt.azure_openai import AzureOpenAIWhisper

whisper = AzureOpenAIWhisper(
    endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    api_version="2024-02-01",
    deployment_id="whisper",
)

# Transcribe an audio file
transcript = whisper.transcribe_audio(file_path="meeting_recording.mp3")
print(transcript)

Constructor parameters:

Parameter	Description
`endpoint`	Azure OpenAI service endpoint URL
`api_key`	Azure OpenAI API key
`api_version`	API version (e.g. `"2024-02-01"`)
`deployment_id`	Whisper deployment name in your Azure resource

Methods:

Method	Description
`transcribe_audio(file_path)`	Transcribes the audio file at `file_path` and returns the transcribed text as a string

Environment variables: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_API_VERSION, AZURE_OPENAI_DEPLOYMENT_ID

AzureOpenAITTS — Text-to-Speech

Converts text to speech using Azure OpenAI TTS models.

Available voices: alloy, echo, fable, onyx, nova, shimmer

Supported audio formats: mp3, opus, aac, flac, wav, pcm

python

from elsai_stt.tts.azure_openai import AzureOpenAITTS

tts = AzureOpenAITTS(
    endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    api_version="2024-02-01",
    deployment_id="tts",
)

# Generate speech and save to file
output_path = tts.text_to_speech(
    text="Hello! Welcome to elsai.",
    voice="alloy",
    format="mp3",
    speed=1.0,
    save_to="output.mp3",
)
print(output_path)

Constructor parameters:

Parameter	Description
`endpoint`	Azure OpenAI service endpoint URL
`api_key`	Azure OpenAI API key
`api_version`	API version (e.g. `"2024-02-01"`)
`deployment_id`	TTS deployment name in your Azure resource

text_to_speech() parameters:

Parameter	Description
`text`	The text string to synthesize
`voice`	Voice to use: `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, or `"shimmer"`
`format`	Output audio format: `"mp3"`, `"opus"`, `"aac"`, `"flac"`, `"wav"`, or `"pcm"`
`speed`	Playback speed — `0.25` (slowest) to `4.0` (fastest); `1.0` is normal
`save_to`	File path to save the generated audio

Returns the path to the saved audio file.

AzureOpenAISpeechToSpeech — End-to-end pipeline

Combines STT and TTS in a single class for voice-in / voice-out workflows. Supports two initialization modes: shared Azure resource (Whisper and TTS on the same deployment) or separate resources.

Shared resource

python

from elsai_stt.s2s.azure_openai import AzureOpenAISpeechToSpeech

s2s = AzureOpenAISpeechToSpeech(
    endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    api_version="2024-02-01",
    whisper_deployment_id="whisper",
    tts_deployment_id="tts",
)

Separate resources

python

s2s = AzureOpenAISpeechToSpeech(
    whisper_endpoint="https://your-whisper-resource.openai.azure.com/",
    whisper_api_key="your-whisper-api-key",
    whisper_api_version="2024-02-01",
    whisper_deployment_id="whisper",
    tts_endpoint="https://your-tts-resource.openai.azure.com/",
    tts_api_key="your-tts-api-key",
    tts_api_version="2024-02-01",
    tts_deployment_id="tts",
)

Transcribe audio

python

# From a file path
text = s2s.transcribe_audio(
    file_path="user_question.mp3",
    output_format="mp3",
    sample_rate=16000,
)
print(text)

transcribe_audio() parameters:

Parameter	Description
`file_path`	Path to the input audio file (optional if `buffer` is provided)
`buffer`	Audio data as bytes buffer (optional if `file_path` is provided)
`output_format`	Audio format for any intermediate processing
`sample_rate`	Sample rate for the audio

Synthesize speech

python

# Save to file
output_path = s2s.synthesize_speech(
    text="Here is your answer.",
    voice="nova",
    response_format="mp3",
    speed=1.0,
    output_path="response.mp3",
)

# Return as bytes buffer
audio_bytes = s2s.synthesize_speech(
    text="Here is your answer.",
    voice="nova",
    return_buffer=True,
)

synthesize_speech() parameters:

Parameter	Description
`text`	Text to convert to speech
`voice`	Voice to use (e.g. `"alloy"`, `"nova"`, `"shimmer"`)
`response_format`	Output audio format (e.g. `"mp3"`, `"wav"`)
`speed`	Playback speed — `0.25` to `4.0`
`output_path`	File path to save the audio (optional)
`return_buffer`	If `True`, returns audio as bytes instead of saving to file

Full Speech-to-Speech

Process an audio input, optionally transform the transcript with a callable, and return synthesized audio:

python

# Basic — transcribe input, speak the transcript back
result = s2s.process_speech_to_speech(
    input_file_path="user_question.mp3",
    output_voice="nova",
    output_format="mp3",
    output_path="agent_response.mp3",
)

# With a text processor (e.g. send transcript through an LLM)
def answer_with_llm(transcript: str) -> str:
    return my_agent(transcript)

result = s2s.process_speech_to_speech(
    input_file_path="user_question.mp3",
    text_processor=answer_with_llm,
    return_transcription=True,    # also return the input transcript
    output_voice="nova",
    output_format="mp3",
    output_path="agent_response.mp3",
)

audio, transcript = result

process_speech_to_speech() parameters:

Parameter	Description
`input_file_path`	Path to the input audio file (optional if `input_buffer` provided)
`input_buffer`	Audio data as bytes (optional if `input_file_path` provided)
`text_processor`	Optional callable that receives the transcript and returns a response string — use to inject an LLM or agent
`return_transcription`	If `True`, returns a `(audio, transcript)` tuple instead of just the audio
`output_voice`	TTS voice for the response
`output_format`	Output audio format
`output_speed`	Playback speed for the synthesized response
`output_path`	File path to save the response audio

Return value: audio file path or bytes; a (audio, transcript) tuple when return_transcription=True.

Version history

Version	Changes
0.1.0	Initial release — `AzureOpenAIWhisper`, `AzureOpenAITTS`, `AzureOpenAISpeechToSpeech`

elsai Speech Services ​

Installation ​

Available classes ​

AzureOpenAIWhisper — Speech-to-Text ​

AzureOpenAITTS — Text-to-Speech ​

AzureOpenAISpeechToSpeech — End-to-end pipeline ​

Shared resource ​

Separate resources ​

Transcribe audio ​

Synthesize speech ​

Full Speech-to-Speech ​

Version history ​

elsai Speech Services

Installation

Available classes

AzureOpenAIWhisper — Speech-to-Text

AzureOpenAITTS — Text-to-Speech

AzureOpenAISpeechToSpeech — End-to-end pipeline

Shared resource

Separate resources

Transcribe audio

Synthesize speech

Full Speech-to-Speech

Version history