Appearance
Elsai Speech Services
Package: elsai-stt v0.1.0
Audio processing capabilities including Speech-to-Text (STT), Text-to-Speech (TTS), and end-to-end Speech-to-Speech conversion using Azure OpenAI Whisper and TTS models.
Installation
bash
pip install --extra-index-url https://core-packages.elsai.ai/root/elsai-stt/ elsai-stt==0.1.0Requirements: Python >= 3.9, openai, pydub, numpy, python-dotenv
Available classes
| Class | Import path | Purpose |
|---|---|---|
AzureOpenAIWhisper | elsai_stt.stt.azure_openai | Speech-to-Text transcription |
AzureOpenAITTS | elsai_stt.tts.azure_openai | Text-to-Speech synthesis |
AzureOpenAISpeechToSpeech | elsai_stt.s2s.azure_openai | End-to-end Speech-to-Speech pipeline |
AzureOpenAIWhisper — Speech-to-Text
Transcribes audio files to text using Azure's hosted OpenAI Whisper model.
python
from elsai_stt.stt.azure_openai import AzureOpenAIWhisper
whisper = AzureOpenAIWhisper(
endpoint="https://your-resource.openai.azure.com/",
api_key="your-api-key",
api_version="2024-02-01",
deployment_id="whisper",
)
# Transcribe an audio file
transcript = whisper.transcribe_audio(file_path="meeting_recording.mp3")
print(transcript)Constructor parameters:
| Parameter | Description |
|---|---|
endpoint | Azure OpenAI service endpoint URL |
api_key | Azure OpenAI API key |
api_version | API version (e.g. "2024-02-01") |
deployment_id | Whisper deployment name in your Azure resource |
Methods:
| Method | Description |
|---|---|
transcribe_audio(file_path) | Transcribes the audio file at file_path and returns the transcribed text as a string |
Environment variables: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_API_VERSION, AZURE_OPENAI_DEPLOYMENT_ID
AzureOpenAITTS — Text-to-Speech
Converts text to speech using Azure OpenAI TTS models.
Available voices: alloy, echo, fable, onyx, nova, shimmer
Supported audio formats: mp3, opus, aac, flac, wav, pcm
python
from elsai_stt.tts.azure_openai import AzureOpenAITTS
tts = AzureOpenAITTS(
endpoint="https://your-resource.openai.azure.com/",
api_key="your-api-key",
api_version="2024-02-01",
deployment_id="tts",
)
# Generate speech and save to file
output_path = tts.text_to_speech(
text="Hello! Welcome to Elsai.",
voice="alloy",
format="mp3",
speed=1.0,
save_to="output.mp3",
)
print(output_path)Constructor parameters:
| Parameter | Description |
|---|---|
endpoint | Azure OpenAI service endpoint URL |
api_key | Azure OpenAI API key |
api_version | API version (e.g. "2024-02-01") |
deployment_id | TTS deployment name in your Azure resource |
text_to_speech() parameters:
| Parameter | Description |
|---|---|
text | The text string to synthesize |
voice | Voice to use: "alloy", "echo", "fable", "onyx", "nova", or "shimmer" |
format | Output audio format: "mp3", "opus", "aac", "flac", "wav", or "pcm" |
speed | Playback speed — 0.25 (slowest) to 4.0 (fastest); 1.0 is normal |
save_to | File path to save the generated audio |
Returns the path to the saved audio file.
AzureOpenAISpeechToSpeech — End-to-end pipeline
Combines STT and TTS in a single class for voice-in / voice-out workflows. Supports two initialization modes: shared Azure resource (Whisper and TTS on the same deployment) or separate resources.
Shared resource
python
from elsai_stt.s2s.azure_openai import AzureOpenAISpeechToSpeech
s2s = AzureOpenAISpeechToSpeech(
endpoint="https://your-resource.openai.azure.com/",
api_key="your-api-key",
api_version="2024-02-01",
whisper_deployment_id="whisper",
tts_deployment_id="tts",
)Separate resources
python
s2s = AzureOpenAISpeechToSpeech(
whisper_endpoint="https://your-whisper-resource.openai.azure.com/",
whisper_api_key="your-whisper-api-key",
whisper_api_version="2024-02-01",
whisper_deployment_id="whisper",
tts_endpoint="https://your-tts-resource.openai.azure.com/",
tts_api_key="your-tts-api-key",
tts_api_version="2024-02-01",
tts_deployment_id="tts",
)Transcribe audio
python
# From a file path
text = s2s.transcribe_audio(
file_path="user_question.mp3",
output_format="mp3",
sample_rate=16000,
)
print(text)transcribe_audio() parameters:
| Parameter | Description |
|---|---|
file_path | Path to the input audio file (optional if buffer is provided) |
buffer | Audio data as bytes buffer (optional if file_path is provided) |
output_format | Audio format for any intermediate processing |
sample_rate | Sample rate for the audio |
Synthesize speech
python
# Save to file
output_path = s2s.synthesize_speech(
text="Here is your answer.",
voice="nova",
response_format="mp3",
speed=1.0,
output_path="response.mp3",
)
# Return as bytes buffer
audio_bytes = s2s.synthesize_speech(
text="Here is your answer.",
voice="nova",
return_buffer=True,
)synthesize_speech() parameters:
| Parameter | Description |
|---|---|
text | Text to convert to speech |
voice | Voice to use (e.g. "alloy", "nova", "shimmer") |
response_format | Output audio format (e.g. "mp3", "wav") |
speed | Playback speed — 0.25 to 4.0 |
output_path | File path to save the audio (optional) |
return_buffer | If True, returns audio as bytes instead of saving to file |
Full Speech-to-Speech
Process an audio input, optionally transform the transcript with a callable, and return synthesized audio:
python
# Basic — transcribe input, speak the transcript back
result = s2s.process_speech_to_speech(
input_file_path="user_question.mp3",
output_voice="nova",
output_format="mp3",
output_path="agent_response.mp3",
)
# With a text processor (e.g. send transcript through an LLM)
def answer_with_llm(transcript: str) -> str:
return my_agent(transcript)
result = s2s.process_speech_to_speech(
input_file_path="user_question.mp3",
text_processor=answer_with_llm,
return_transcription=True, # also return the input transcript
output_voice="nova",
output_format="mp3",
output_path="agent_response.mp3",
)
audio, transcript = resultprocess_speech_to_speech() parameters:
| Parameter | Description |
|---|---|
input_file_path | Path to the input audio file (optional if input_buffer provided) |
input_buffer | Audio data as bytes (optional if input_file_path provided) |
text_processor | Optional callable that receives the transcript and returns a response string — use to inject an LLM or agent |
return_transcription | If True, returns a (audio, transcript) tuple instead of just the audio |
output_voice | TTS voice for the response |
output_format | Output audio format |
output_speed | Playback speed for the synthesized response |
output_path | File path to save the response audio |
Return value: audio file path or bytes; a (audio, transcript) tuple when return_transcription=True.
Version history
| Version | Changes |
|---|---|
| 0.1.0 | Initial release — AzureOpenAIWhisper, AzureOpenAITTS, AzureOpenAISpeechToSpeech |