ASR (Speech to Text) Skill
This skill guides the implementation of speech-to-text (ASR) functionality using the z-ai-web-dev-sdk package, enabling accurate transcription of spoken audio into text.
Skills Path
Skill Location: {project_path}/skills/ASR
this skill is located at above path in your project.
Reference Scripts: Example test scripts are available in the {Skill Location}/scripts/ directory for quick testing and reference. See {Skill Location}/scripts/asr.ts for a working example.
Overview
Speech-to-Text (ASR - Automatic Speech Recognition) allows you to build applications that convert spoken language in audio files into written text, enabling voice-controlled interfaces, transcription services, and audio content analysis.
IMPORTANT: z-ai-web-dev-sdk MUST be used in backend code only. Never use it in client-side code.
Prerequisites
The z-ai-web-dev-sdk package is already installed. Import it as shown in the examples below.
CLI Usage (For Simple Tasks)
For simple audio transcription tasks, you can use the z-ai CLI instead of writing code. This is ideal for quick transcriptions, testing audio files, or batch processing.
Basic Transcription from File
Transcribe an audio file
z-ai asr --file ./audio.wav
Save transcription to JSON file
z-ai asr -f ./recording.mp3 -o transcript.json
Transcribe and view output
z-ai asr --file ./interview.wav --output result.json
Transcription from Base64
Transcribe from base64 encoded audio
z-ai asr --base64 "UklGRiQAAABXQVZFZm10..." -o result.json
Using short option
z-ai asr -b "base64_encoded_audio_data" -o transcript.json
Streaming Output
Stream transcription results
z-ai asr -f ./audio.wav --stream
CLI Parameters
-
--file, -f <path> : Required (if not using --base64) - Audio file path
-
--base64, -b <base64> : Required (if not using --file) - Base64 encoded audio
-
--output, -o <path> : Optional - Output file path (JSON format)
-
--stream : Optional - Stream the transcription output
Supported Audio Formats
The ASR service supports various audio formats including:
-
WAV (.wav)
-
MP3 (.mp3)
-
Other common audio formats
When to Use CLI vs SDK
Use CLI for:
-
Quick audio file transcriptions
-
Testing audio recognition accuracy
-
Simple batch processing scripts
-
One-off transcription tasks
Use SDK for:
-
Real-time audio transcription in applications
-
Integration with recording systems
-
Custom audio processing workflows
-
Production applications with streaming audio
Basic ASR Implementation
Simple Audio Transcription
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function transcribeAudio(audioFilePath) { const zai = await ZAI.create();
// Read audio file and convert to base64 const audioFile = fs.readFileSync(audioFilePath); const base64Audio = audioFile.toString('base64');
const response = await zai.audio.asr.create({ file_base64: base64Audio });
return response.text; }
// Usage const transcription = await transcribeAudio('./audio.wav'); console.log('Transcription:', transcription);
Transcribe Multiple Audio Files
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function transcribeBatch(audioFilePaths) { const zai = await ZAI.create(); const results = [];
for (const filePath of audioFilePaths) { try { const audioFile = fs.readFileSync(filePath); const base64Audio = audioFile.toString('base64');
const response = await zai.audio.asr.create({
file_base64: base64Audio
});
results.push({
file: filePath,
success: true,
transcription: response.text
});
} catch (error) {
results.push({
file: filePath,
success: false,
error: error.message
});
}
}
return results; }
// Usage const files = ['./interview1.wav', './interview2.wav', './interview3.wav']; const transcriptions = await transcribeBatch(files);
transcriptions.forEach(result => {
if (result.success) {
console.log(${result.file}: ${result.transcription});
} else {
console.error(${result.file}: Error - ${result.error});
}
});
Advanced Use Cases
Audio File Processing with Metadata
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs'; import path from 'path';
async function transcribeWithMetadata(audioFilePath) { const zai = await ZAI.create();
// Get file metadata const stats = fs.statSync(audioFilePath); const audioFile = fs.readFileSync(audioFilePath); const base64Audio = audioFile.toString('base64');
const startTime = Date.now();
const response = await zai.audio.asr.create({ file_base64: base64Audio });
const endTime = Date.now();
return { filename: path.basename(audioFilePath), filepath: audioFilePath, fileSize: stats.size, transcription: response.text, wordCount: response.text.split(/\s+/).length, processingTime: endTime - startTime, timestamp: new Date().toISOString() }; }
// Usage const result = await transcribeWithMetadata('./meeting_recording.wav'); console.log('Transcription Details:', JSON.stringify(result, null, 2));
Real-time Audio Processing Service
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
class ASRService { constructor() { this.zai = null; this.transcriptionCache = new Map(); }
async initialize() { this.zai = await ZAI.create(); }
generateCacheKey(audioBuffer) { const crypto = require('crypto'); return crypto.createHash('md5').update(audioBuffer).digest('hex'); }
async transcribe(audioFilePath, useCache = true) { const audioBuffer = fs.readFileSync(audioFilePath); const cacheKey = this.generateCacheKey(audioBuffer);
// Check cache
if (useCache && this.transcriptionCache.has(cacheKey)) {
return {
transcription: this.transcriptionCache.get(cacheKey),
cached: true
};
}
// Transcribe audio
const base64Audio = audioBuffer.toString('base64');
const response = await this.zai.audio.asr.create({
file_base64: base64Audio
});
// Cache result
if (useCache) {
this.transcriptionCache.set(cacheKey, response.text);
}
return {
transcription: response.text,
cached: false
};
}
clearCache() { this.transcriptionCache.clear(); }
getCacheSize() { return this.transcriptionCache.size; } }
// Usage const asrService = new ASRService(); await asrService.initialize();
const result1 = await asrService.transcribe('./audio.wav'); console.log('First call (not cached):', result1);
const result2 = await asrService.transcribe('./audio.wav'); console.log('Second call (cached):', result2);
Directory Transcription
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs'; import path from 'path';
async function transcribeDirectory(directoryPath, outputJsonPath) { const zai = await ZAI.create();
// Get all audio files const files = fs.readdirSync(directoryPath); const audioFiles = files.filter(file => /.(wav|mp3|m4a|flac|ogg)$/i.test(file) );
const results = { directory: directoryPath, totalFiles: audioFiles.length, processedAt: new Date().toISOString(), transcriptions: [] };
for (const filename of audioFiles) { const filePath = path.join(directoryPath, filename);
try {
const audioFile = fs.readFileSync(filePath);
const base64Audio = audioFile.toString('base64');
const response = await zai.audio.asr.create({
file_base64: base64Audio
});
results.transcriptions.push({
filename: filename,
success: true,
text: response.text,
wordCount: response.text.split(/\s+/).length
});
console.log(`✓ Transcribed: ${filename}`);
} catch (error) {
results.transcriptions.push({
filename: filename,
success: false,
error: error.message
});
console.error(`✗ Failed: ${filename} - ${error.message}`);
}
}
// Save results to JSON fs.writeFileSync( outputJsonPath, JSON.stringify(results, null, 2) );
return results; }
// Usage const results = await transcribeDirectory( './audio-recordings', './transcriptions.json' );
console.log(\nProcessed ${results.totalFiles} files);
console.log(Successful: ${results.transcriptions.filter(t => t.success).length});
console.log(Failed: ${results.transcriptions.filter(t => !t.success).length});
Best Practices
- Audio Format Handling
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function transcribeAnyFormat(audioFilePath) { // Supported formats: WAV, MP3, M4A, FLAC, OGG, etc. const validExtensions = ['.wav', '.mp3', '.m4a', '.flac', '.ogg']; const ext = audioFilePath.toLowerCase().substring(audioFilePath.lastIndexOf('.'));
if (!validExtensions.includes(ext)) {
throw new Error(Unsupported audio format: ${ext});
}
const zai = await ZAI.create(); const audioFile = fs.readFileSync(audioFilePath); const base64Audio = audioFile.toString('base64');
const response = await zai.audio.asr.create({ file_base64: base64Audio });
return response.text; }
- Error Handling
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function safeTranscribe(audioFilePath) {
try {
// Validate file exists
if (!fs.existsSync(audioFilePath)) {
throw new Error(File not found: ${audioFilePath});
}
// Check file size (e.g., limit to 100MB)
const stats = fs.statSync(audioFilePath);
const fileSizeMB = stats.size / (1024 * 1024);
if (fileSizeMB > 100) {
throw new Error(`File too large: ${fileSizeMB.toFixed(2)}MB (max 100MB)`);
}
// Transcribe
const zai = await ZAI.create();
const audioFile = fs.readFileSync(audioFilePath);
const base64Audio = audioFile.toString('base64');
const response = await zai.audio.asr.create({
file_base64: base64Audio
});
if (!response.text || response.text.trim().length === 0) {
throw new Error('Empty transcription result');
}
return {
success: true,
transcription: response.text,
filePath: audioFilePath,
fileSize: stats.size
};
} catch (error) { console.error('Transcription error:', error); return { success: false, error: error.message, filePath: audioFilePath }; } }
- Post-Processing Transcriptions
function cleanTranscription(text) { // Remove excessive whitespace text = text.replace(/\s+/g, ' ').trim();
// Capitalize first letter of sentences text = text.replace(/(^\w|[.!?]\s+\w)/g, match => match.toUpperCase());
// Remove filler words (optional)
const fillers = ['um', 'uh', 'ah', 'like', 'you know'];
const fillerPattern = new RegExp(\\b(${fillers.join('|')})\\b, 'gi');
text = text.replace(fillerPattern, '').replace(/\s+/g, ' ');
return text; }
async function transcribeAndClean(audioFilePath) { const zai = await ZAI.create();
const audioFile = fs.readFileSync(audioFilePath); const base64Audio = audioFile.toString('base64');
const response = await zai.audio.asr.create({ file_base64: base64Audio });
return { raw: response.text, cleaned: cleanTranscription(response.text) }; }
Common Use Cases
-
Meeting Transcription: Convert recorded meetings into searchable text
-
Interview Processing: Transcribe interviews for analysis and documentation
-
Podcast Transcription: Create text versions of podcast episodes
-
Voice Notes: Convert voice memos to text for easier reference
-
Call Center Analytics: Analyze customer service calls
-
Accessibility: Provide text alternatives for audio content
-
Voice Commands: Enable voice-controlled applications
-
Language Learning: Transcribe pronunciation practice
Integration Examples
Express.js API Endpoint
import express from 'express'; import multer from 'multer'; import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
const app = express(); const upload = multer({ dest: 'uploads/' });
let zaiInstance;
async function initZAI() { zaiInstance = await ZAI.create(); }
app.post('/api/transcribe', upload.single('audio'), async (req, res) => { try { if (!req.file) { return res.status(400).json({ error: 'No audio file provided' }); }
const audioFile = fs.readFileSync(req.file.path);
const base64Audio = audioFile.toString('base64');
const response = await zaiInstance.audio.asr.create({
file_base64: base64Audio
});
// Clean up uploaded file
fs.unlinkSync(req.file.path);
res.json({
success: true,
transcription: response.text,
wordCount: response.text.split(/\s+/).length
});
} catch (error) { // Clean up on error if (req.file && fs.existsSync(req.file.path)) { fs.unlinkSync(req.file.path); }
res.status(500).json({
success: false,
error: error.message
});
} });
initZAI().then(() => { app.listen(3000, () => { console.log('ASR API running on port 3000'); }); });
Troubleshooting
Issue: "SDK must be used in backend"
- Solution: Ensure z-ai-web-dev-sdk is only imported in server-side code
Issue: Empty or incorrect transcription
- Solution: Verify audio quality and format. Check if audio contains clear speech
Issue: Large file processing fails
- Solution: Consider splitting large audio files into smaller segments
Issue: Slow transcription speed
- Solution: Implement caching for repeated transcriptions, optimize file sizes
Issue: Memory errors with large files
- Solution: Process files in chunks or increase Node.js memory limit
Performance Tips
-
Reuse SDK Instance: Create once, use multiple times
-
Implement Caching: Cache transcriptions for duplicate files
-
Batch Processing: Process multiple files efficiently with proper queuing
-
Audio Optimization: Compress audio files before processing when possible
-
Async Operations: Use Promise.all for parallel processing when appropriate
Audio Quality Guidelines
For best transcription results:
-
Sample Rate: 16kHz or higher
-
Format: WAV, MP3, or M4A recommended
-
Noise Level: Minimize background noise
-
Speech Clarity: Clear pronunciation and normal speaking pace
-
File Size: Under 100MB recommended for individual files
Remember
-
Always use z-ai-web-dev-sdk in backend code only
-
The SDK is already installed - import as shown in examples
-
Audio files must be converted to base64 before processing
-
Implement proper error handling for production applications
-
Consider audio quality for best transcription accuracy
-
Clean up temporary files after processing
-
Cache results for frequently transcribed files