funasr-transcribe

This skill should be used when the user explicitly asks to "transcribe a meeting", "transcribe audio", "transcribe a meeting recording", "convert audio to text", "generate meeting minutes from audio", "do speech-to-text", "transcribe with speaker diarization", "identify speakers in audio", "transcribe Chinese audio", "transcribe English audio", "transcribe Japanese audio", "multi-speaker transcription", "transcribe a podcast", "transcribe podcast episode", "transcribe an interview", "convert podcast to text", "podcast to transcript", or mentions FunASR, Paraformer, SenseVoice, Whisper, meeting transcription, podcast transcription, or speaker diarization. Supports multi-speaker meeting and podcast transcription in Chinese, English, Japanese, Korean, Cantonese, and 99 languages (via Whisper) with automatic speaker diarization and hotword biasing. Works on both GPU and CPU.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "funasr-transcribe" with this command: npx skills add zxkane/zxkane-audio-transcriber-funasr

FunASR Meeting & Podcast Transcription

Transcribe multi-speaker audio into structured Markdown with automatic speaker diarization, hotword biasing, and optional LLM cleanup.

All scripts run directly from the plugin directory — no copying needed. Define this shorthand at the start of every session:

SCRIPTS=${CLAUDE_PLUGIN_ROOT}/skills/funasr-transcribe/scripts

Supported Languages

--langModelLanguagesHotword
zh (default)SeACo-ParaformerChinese (CER 1.95%)Yes
zh-basicParaformer-largeChineseNo
enParaformer-enEnglishNo
autoSenseVoiceSmallAuto-detect: zh/en/ja/ko/yueNo
whisperWhisper-large-v3-turbo99 languagesNo

All presets include speaker diarization (CAM++) and VAD (FSMN).

Diarization caveat: auto and whisper do not output per-sentence timestamps, so speaker diarization does not work with these presets. Use zh, zh-basic, or en when speaker identification is needed (e.g., podcasts, meetings).

Workflow

Before starting transcription, always ask the user:

  1. Audio file — path to the recording (required)
  2. Type — meeting, podcast, or interview? (affects defaults)
  3. Language — what language is spoken? (default: Chinese)
  4. Number of speakers — how many participants? (improves diarization)
  5. Speaker names — for podcasts: host + guest names; for meetings: attendee list
  6. Supporting files — ask:

    "Do you have any of the following to improve accuracy?"

    • Attendee / guest list — for hotwords and speaker mapping
    • Meeting agenda or episode topic — for hotwords (terms, names)
    • Reference documents (show notes, prior notes) — for speaker identification and ASR correction

Adapt defaults by recording type:

  • Meeting: default --lang zh, ask about supporting files
  • Podcast / interview: default --lang zh, --num-speakers 2, always ask for host + guest names, suggest --speaker-context for roles (do NOT use --lang auto — it lacks timestamps for speaker diarization)

⚠️ --speakers must use the speaker's real name, not a podcast alias. The value passed to --speakers is used verbatim as the speaker label in the output transcript. Always derive it from the host/guest's actual name (e.g. from a shownotes "Host:" field), not from the podcast feed name or title.

Example: if shownotes lists "Host: 张三(张三的播客)", pass --speakers '张三' — not the alias "张三的播客". Add both the real name and the alias to hotwords.txt so ASR can recognise both forms.

When both --speakers and --reference are supplied, the script detects this mistake at startup and prints an ACTION REQUIRED block naming the suggested real name. If you see that block, stop the run and re-invoke with the corrected --speakers value before Phase 3 — the warning does not abort the pipeline.

If the user provides supporting materials:

  • Extract participant names and key terms → create hotwords.txt (include both real name and alias)
  • Extract per-person context → create speaker-context.json
  • Pass original reference document with --reference
  • Use all three together for best results

Quick Start

1. Environment Setup

AUTO_YES=1 bash $SCRIPTS/setup_env.sh
# Or force CPU:  AUTO_YES=1 bash $SCRIPTS/setup_env.sh cpu

The setup script patches FunASR's spectral clustering for O(N²·k) performance. Without this, recordings over ~1 hour hang for hours during speaker clustering.

2. Run Transcription

Output files are written to the current working directory.

LLM cleanup (Phase 3) is opt-in. By default, transcription runs locally without contacting any external service. To enable LLM-powered ASR correction and speaker name refinement, pass --model <model-id>. Use LLM cleanup when:

  • The raw transcript has many ASR errors (names, technical terms)
  • You need polished, publication-ready output
  • Speaker names need to be refined from context

⚠️ Data Privacy: When LLM cleanup is enabled via --model, transcript excerpts are sent to external LLM providers (AWS Bedrock, Anthropic, or OpenAI depending on the model ID). Use --skip-llm or omit --model to keep all data local. For Bedrock, boto3 uses the standard AWS credential chain (IAM role, SSO, ~/.aws/credentials, env vars).

# Chinese meeting with hotwords (local-only, no LLM)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang zh --num-speakers 9 --hotwords hotwords.txt

# English meeting with speaker names
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang en --speakers "Alice,Bob,Carol,Dave"

# Auto-detect language (zh/en/ja/ko/yue)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang auto --num-speakers 6

# Whisper for any language
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang whisper --num-speakers 4

# Enable LLM cleanup for polished output (requires --model)
# Bedrock (uses AWS credential chain: IAM role, SSO, ~/.aws/credentials)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang zh --num-speakers 9 --hotwords hotwords.txt \
    --model us.anthropic.claude-sonnet-4-6

# Anthropic API (requires ANTHROPIC_API_KEY env var)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --model claude-sonnet-4-6

# OpenAI-compatible API (requires OPENAI_API_KEY env var)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --model gpt-4o

# Full pipeline with all supporting files + LLM (best quality)
python3 $SCRIPTS/transcribe_funasr.py episode.m4a \
    --lang zh --num-speakers 2 \
    --hotwords hotwords.txt \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json \
    --reference show-notes.md \
    --model us.anthropic.claude-sonnet-4-6

# Resume interrupted LLM cleanup
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --skip-transcribe --model us.anthropic.claude-sonnet-4-6

3. Verify Speaker Labels

If the transcript has swapped speaker labels (common with podcasts), the verification script can detect and fix mismatches using LLM analysis:

# Dry-run: check if host/guest are swapped
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json

# Apply the fix
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json --fix

# Multi-speaker meeting: full reassignment
python3 $SCRIPTS/verify_speakers.py meeting_raw_transcript.json \
    --speakers "Alice,Bob,Carol,Dave" \
    --speaker-context speaker-context.json --fix

# Then regenerate the markdown with corrected labels
python3 $SCRIPTS/transcribe_funasr.py original.m4a \
    --skip-transcribe --clean-cache

The script analyzes the first 5 minutes (configurable with --minutes) and auto-detects podcast (2 speakers, swap detection) vs meeting (N speakers, full reassignment).

Audio Preprocessing

The script automatically converts input audio to 16kHz mono FLAC and validates that no audio is lost (detects silent truncation).

Format4h14m meetingQualityRecommendation
FLAC219MBLosslessDefault, safest
Opus55MBLossyRisk of truncation on long files
WAV465MBLosslessWorks but larger
Original M4A173MBSourceAlso works directly

Do NOT split long recordings — splitting breaks speaker ID consistency.

Key Flags

FlagPurpose
--langzh (default), zh-basic, en, auto, whisper
--hotwordsHotword file or string — biases ASR (zh only)
--reference FReference file for LLM ASR correction
--num-speakers NExpected speaker count (improves diarization)
--speakers "A,B,C"Assign real names by first-appearance order
--speaker-context FJSON with per-speaker roles for LLM
--no-detect-genderDisable automatic speaker gender detection (CAM++ gender classifier)
--speaker-genders "A:female,B:male"Override per-speaker gender (also accepts positional female,male)
--audio-formatflac (default), opus, wav
--device cpuForce CPU mode
--batch-size NAdjust for memory (60 for CPU, 100 if GPU OOM)
--phase1-onlyExit after Phase 1 (VAD + ASR + diarization), skip Phase 2 + 3
--json-out PATHWrite raw transcript JSON to explicit path (overrides default naming)
--skip-transcribeResume from saved *_raw_transcript.json
--skip-llmSkip LLM cleanup (default when --model is omitted)
--model IDEnable LLM cleanup with this model (auto-detects Bedrock/Anthropic/OpenAI)
--title "..."Output document title
--clean-cacheDelete LLM chunk cache after completion
--output PATHCustom output file path
--model-cache-dirModelScope model cache directory (~3GB, default: ~/.cache/modelscope/)

Outputs

  • <stem>-transcript.md — Final Markdown with speaker labels and timestamps
  • <stem>_raw_transcript.json — Raw Phase 1 output (for resume/analysis)

Speaker Diarization Tips

FunASR's CAM++ may merge acoustically similar speakers. To improve:

  1. --num-speakers N — Hint expected count
  2. --hotwords — Include participant names (Chinese names work best)
  3. --speaker-context — Provide per-person keywords for LLM splitting
  4. Keyword matching — Search *_raw_transcript.json for unique phrases

Speaker gender

Enabled by default: each detected speaker is classified as male / female via 3D-Speaker's CAM++ gender classifier (iic/speech_campplus_two_class_gender_16k). The result appears next to each name in the Speaker List table and is injected into the LLM cleanup prompt so pronouns (他/她, he/she) get corrected.

Precedence when combined:

  1. --speaker-genders "Alice:female,Bob:male" (explicit CLI) — always wins
  2. Reference text hints like 主播(女):韩梅梅 or Host (male): Alice — override auto
  3. CAM++ auto-detection — fallback

Disable with --no-detect-gender if you don't need gender and want to save the ~500 MB model download and extra inference time.

CPU-only / Low-Memory Machines

Long recordings on resource-constrained machines may hit exec timeouts or OOM kills. See references/pipeline-details.md for workarounds:

  • Detach from agent timeouts with systemd-run or nohup
  • Prevent OOM via swap and/or --lang zh-basic (lighter model)

Additional Resources

  • references/pipeline-details.md — Architecture, model specs, benchmarks, speaker role verification, hotword effectiveness, clustering patch
  • scripts/transcribe_funasr.py — Main transcription pipeline
  • scripts/verify_speakers.py — Speaker label verification & fix
  • scripts/llm_utils.py — Shared LLM infrastructure (Bedrock/Anthropic/OpenAI)
  • scripts/setup_env.sh — Environment setup (venv + deps + patch)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

MigraQ

腾讯云迁移平台(CMG/MSP)全流程能力。触发词:资源扫描、扫描阿里云/AWS/华为云/GCP资源、生成云资源清单、选型推荐、对标腾讯云、推荐规格、帮我推荐、给我推荐、ECS对应什么腾讯云产品、成本分析、TCO、迁移报价、询价、价格计算器、cmg-scan、cmg-recommend、cmg-tco

Registry SourceRecently Updated
General

AI Product Manager

OpenClaw-first AI product manager for turning analytics, revenue, crash, store, and feedback signals into execution-ready proposals and backlog work.

Registry SourceRecently Updated
General

Word Template Filler

Word模板智能填充工具。解析Word模板中的{{占位符}},AI生成主题内容并填充模板生成新文档。触发场景:(1) 用户上传Word模板需要填充内容 (2) 处理包含占位符的Word文档 (3) 批量生成基于模板的文档 (4) 关键词:模板填充、Word模板、占位符、文档生成、{{}}替换。

Registry SourceRecently Updated
General

微信支付医保支付接入skill

微信支付医保支付(移动医保支付2.0)接入解决方案,覆盖医保自费混合下单、订单查询、医保退款通知、医保收款回调全链路,提供选型/示例代码/业务速查/质量评估/排障五大能力。Use when user mentions "医保支付", "移动医保支付2.0", "医保自费混合", "医保电子凭证", "医保混合收款...

Registry SourceRecently Updated