Azure Speech to Text REST API for Short Audio
Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.
Prerequisites
-
Azure subscription - Create one free
-
Speech resource - Create in Azure Portal
-
Get credentials - After deployment, go to resource > Keys and Endpoint
Environment Variables
Required
AZURE_SPEECH_KEY=<your-speech-resource-key> AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope
Alternative: Use endpoint directly
AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com
Installation
pip install requests
Quick Start
import os import requests
def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict: """Transcribe short audio file (max 60 seconds) using REST API.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
params = {
"language": language,
"format": "detailed" # or "simple"
}
with open(audio_file_path, "rb") as audio_file:
response = requests.post(url, headers=headers, params=params, data=audio_file)
response.raise_for_status()
return response.json()
Usage
result = transcribe_audio("audio.wav", "en-US") print(result["DisplayText"])
Audio Requirements
Format Codec Sample Rate Notes
WAV PCM 16 kHz, mono Recommended
OGG OPUS 16 kHz, mono Smaller file size
Limitations:
-
Maximum 60 seconds of audio
-
For pronunciation assessment: maximum 30 seconds
-
No partial/interim results (final only)
Content-Type Headers
WAV PCM 16kHz
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"
OGG OPUS
"Content-Type": "audio/ogg; codecs=opus"
Response Formats
Simple Format (default)
params = {"language": "en-US", "format": "simple"}
{ "RecognitionStatus": "Success", "DisplayText": "Remind me to buy 5 pencils.", "Offset": "1236645672289", "Duration": "1236645672289" }
Detailed Format
params = {"language": "en-US", "format": "detailed"}
{ "RecognitionStatus": "Success", "Offset": "1236645672289", "Duration": "1236645672289", "NBest": [ { "Confidence": 0.9052885, "Display": "What's the weather like?", "ITN": "what's the weather like", "Lexical": "what's the weather like", "MaskedITN": "what's the weather like" } ] }
Chunked Transfer (Recommended)
For lower latency, stream audio in chunks:
import os import requests
def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict: """Stream audio in chunks for lower latency.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json",
"Transfer-Encoding": "chunked",
"Expect": "100-continue"
}
params = {"language": language, "format": "detailed"}
def generate_chunks(file_path: str, chunk_size: int = 1024):
with open(file_path, "rb") as f:
while chunk := f.read(chunk_size):
yield chunk
response = requests.post(
url,
headers=headers,
params=params,
data=generate_chunks(audio_file_path)
)
response.raise_for_status()
return response.json()
Authentication Options
Option 1: Subscription Key (Simple)
headers = { "Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"] }
Option 2: Bearer Token
import requests import os
def get_access_token() -> str: """Get access token from the token endpoint.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]
token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"
response = requests.post(
token_url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "application/x-www-form-urlencoded",
"Content-Length": "0"
}
)
response.raise_for_status()
return response.text
Use token in requests (valid for 10 minutes)
token = get_access_token() headers = { "Authorization": f"Bearer {token}", "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json" }
Query Parameters
Parameter Required Values Description
language
Yes en-US , de-DE , etc. Language of speech
format
No simple , detailed
Result format (default: simple)
profanity
No masked , removed , raw
Profanity handling (default: masked)
Recognition Status Values
Status Description
Success
Recognition succeeded
NoMatch
Speech detected but no words matched
InitialSilenceTimeout
Only silence detected
BabbleTimeout
Only noise detected
Error
Internal service error
Profanity Handling
Mask profanity with asterisks (default)
params = {"language": "en-US", "profanity": "masked"}
Remove profanity entirely
params = {"language": "en-US", "profanity": "removed"}
Include profanity as-is
params = {"language": "en-US", "profanity": "raw"}
Error Handling
import requests
def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None: """Transcribe with proper error handling.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
try:
with open(audio_path, "rb") as audio_file:
response = requests.post(
url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
},
params={"language": language, "format": "detailed"},
data=audio_file
)
if response.status_code == 200:
result = response.json()
if result.get("RecognitionStatus") == "Success":
return result
else:
print(f"Recognition failed: {result.get('RecognitionStatus')}")
return None
elif response.status_code == 400:
print(f"Bad request: Check language code or audio format")
elif response.status_code == 401:
print(f"Unauthorized: Check API key or token")
elif response.status_code == 403:
print(f"Forbidden: Missing authorization header")
else:
print(f"Error {response.status_code}: {response.text}")
return None
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
Async Version
import os import aiohttp import asyncio
async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict: """Async version using aiohttp.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
params = {"language": language, "format": "detailed"}
async with aiohttp.ClientSession() as session:
with open(audio_file_path, "rb") as f:
audio_data = f.read()
async with session.post(url, headers=headers, params=params, data=audio_data) as response:
response.raise_for_status()
return await response.json()
Usage
result = asyncio.run(transcribe_async("audio.wav", "en-US")) print(result["DisplayText"])
Supported Languages
Common language codes (see full list):
Code Language
en-US
English (US)
en-GB
English (UK)
de-DE
German
fr-FR
French
es-ES
Spanish (Spain)
es-MX
Spanish (Mexico)
zh-CN
Chinese (Mandarin)
ja-JP
Japanese
ko-KR
Korean
pt-BR
Portuguese (Brazil)
Best Practices
-
Use WAV PCM 16kHz mono for best compatibility
-
Enable chunked transfer for lower latency
-
Cache access tokens for 9 minutes (valid for 10)
-
Specify the correct language for accurate recognition
-
Use detailed format when you need confidence scores
-
Handle all RecognitionStatus values in production code
When NOT to Use This API
Use the Speech SDK or Batch Transcription API instead when you need:
-
Audio longer than 60 seconds
-
Real-time streaming transcription
-
Partial/interim results
-
Speech translation
-
Custom speech models
-
Batch transcription of many files
Reference Files
File Contents
references/pronunciation-assessment.md Pronunciation assessment parameters and scoring