Azure Speech to Text REST API for Short Audio

Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.

Prerequisites

Azure subscription - Create one free
Speech resource - Create in Azure Portal
Get credentials - After deployment, go to resource > Keys and Endpoint

Environment Variables

Required

AZURE_SPEECH_KEY=<your-speech-resource-key> AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope

Alternative: Use endpoint directly

AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com

Installation

pip install requests

Quick Start

import os import requests

def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict: """Transcribe short audio file (max 60 seconds) using REST API.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

headers = {
    "Ocp-Apim-Subscription-Key": api_key,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json"
}

params = {
    "language": language,
    "format": "detailed"  # or "simple"
}

with open(audio_file_path, "rb") as audio_file:
    response = requests.post(url, headers=headers, params=params, data=audio_file)

response.raise_for_status()
return response.json()

Usage

result = transcribe_audio("audio.wav", "en-US") print(result["DisplayText"])

Audio Requirements

Format Codec Sample Rate Notes

WAV PCM 16 kHz, mono Recommended

OGG OPUS 16 kHz, mono Smaller file size

Limitations:

Maximum 60 seconds of audio
For pronunciation assessment: maximum 30 seconds
No partial/interim results (final only)

Content-Type Headers

WAV PCM 16kHz

"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"

OGG OPUS

"Content-Type": "audio/ogg; codecs=opus"

Response Formats

Simple Format (default)

params = {"language": "en-US", "format": "simple"}

{ "RecognitionStatus": "Success", "DisplayText": "Remind me to buy 5 pencils.", "Offset": "1236645672289", "Duration": "1236645672289" }

Detailed Format

params = {"language": "en-US", "format": "detailed"}

{ "RecognitionStatus": "Success", "Offset": "1236645672289", "Duration": "1236645672289", "NBest": [ { "Confidence": 0.9052885, "Display": "What's the weather like?", "ITN": "what's the weather like", "Lexical": "what's the weather like", "MaskedITN": "what's the weather like" } ] }

Chunked Transfer (Recommended)

For lower latency, stream audio in chunks:

import os import requests

def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict: """Stream audio in chunks for lower latency.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

headers = {
    "Ocp-Apim-Subscription-Key": api_key,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json",
    "Transfer-Encoding": "chunked",
    "Expect": "100-continue"
}

params = {"language": language, "format": "detailed"}

def generate_chunks(file_path: str, chunk_size: int = 1024):
    with open(file_path, "rb") as f:
        while chunk := f.read(chunk_size):
            yield chunk

response = requests.post(
    url, 
    headers=headers, 
    params=params, 
    data=generate_chunks(audio_file_path)
)

response.raise_for_status()
return response.json()

Authentication Options

Option 1: Subscription Key (Simple)

headers = { "Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"] }

Option 2: Bearer Token

import requests import os

def get_access_token() -> str: """Get access token from the token endpoint.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"

response = requests.post(
    token_url,
    headers={
        "Ocp-Apim-Subscription-Key": api_key,
        "Content-Type": "application/x-www-form-urlencoded",
        "Content-Length": "0"
    }
)
response.raise_for_status()
return response.text

Use token in requests (valid for 10 minutes)

token = get_access_token() headers = { "Authorization": f"Bearer {token}", "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json" }

Query Parameters

Parameter Required Values Description

language

Yes en-US , de-DE , etc. Language of speech

format

No simple , detailed

Result format (default: simple)

profanity

No masked , removed , raw

Profanity handling (default: masked)

Recognition Status Values

Status Description

Success

Recognition succeeded

NoMatch

Speech detected but no words matched

InitialSilenceTimeout

Only silence detected

BabbleTimeout

Only noise detected

Error

Internal service error

Profanity Handling

Mask profanity with asterisks (default)

params = {"language": "en-US", "profanity": "masked"}

Remove profanity entirely

params = {"language": "en-US", "profanity": "removed"}

Include profanity as-is

params = {"language": "en-US", "profanity": "raw"}

Error Handling

import requests

def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None: """Transcribe with proper error handling.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

try:
    with open(audio_path, "rb") as audio_file:
        response = requests.post(
            url,
            headers={
                "Ocp-Apim-Subscription-Key": api_key,
                "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
                "Accept": "application/json"
            },
            params={"language": language, "format": "detailed"},
            data=audio_file
        )
    
    if response.status_code == 200:
        result = response.json()
        if result.get("RecognitionStatus") == "Success":
            return result
        else:
            print(f"Recognition failed: {result.get('RecognitionStatus')}")
            return None
    elif response.status_code == 400:
        print(f"Bad request: Check language code or audio format")
    elif response.status_code == 401:
        print(f"Unauthorized: Check API key or token")
    elif response.status_code == 403:
        print(f"Forbidden: Missing authorization header")
    else:
        print(f"Error {response.status_code}: {response.text}")
    
    return None
    
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")
    return None

Async Version

import os import aiohttp import asyncio

async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict: """Async version using aiohttp.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

headers = {
    "Ocp-Apim-Subscription-Key": api_key,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json"
}

params = {"language": language, "format": "detailed"}

async with aiohttp.ClientSession() as session:
    with open(audio_file_path, "rb") as f:
        audio_data = f.read()
    
    async with session.post(url, headers=headers, params=params, data=audio_data) as response:
        response.raise_for_status()
        return await response.json()

Usage

result = asyncio.run(transcribe_async("audio.wav", "en-US")) print(result["DisplayText"])

Supported Languages

Common language codes (see full list):

Code Language

en-US

English (US)

en-GB

English (UK)

de-DE

German

fr-FR

French

es-ES

Spanish (Spain)

es-MX

Spanish (Mexico)

zh-CN

Chinese (Mandarin)

ja-JP

Japanese

ko-KR

Korean

pt-BR

Portuguese (Brazil)

Best Practices

Use WAV PCM 16kHz mono for best compatibility
Enable chunked transfer for lower latency
Cache access tokens for 9 minutes (valid for 10)
Specify the correct language for accurate recognition
Use detailed format when you need confidence scores
Handle all RecognitionStatus values in production code

When NOT to Use This API

Use the Speech SDK or Batch Transcription API instead when you need:

Audio longer than 60 seconds
Real-time streaming transcription
Partial/interim results
Speech translation
Custom speech models
Batch transcription of many files

Reference Files

File Contents

references/pronunciation-assessment.md Pronunciation assessment parameters and scoring

azure-speech-to-text-rest-py

Safety Notice

Copy this and send it to your AI assistant to learn

Required

Alternative: Use endpoint directly

Usage

WAV PCM 16kHz

OGG OPUS

Use token in requests (valid for 10 minutes)

Mask profanity with asterisks (default)

Remove profanity entirely

Include profanity as-is

Usage

Source Transparency

Related Skills

azure-observability

azure-appconfiguration-java

azure-aigateway

azure-ai-formrecognizer-java