audiocraft-audio-generation

AudioCraft: Audio Generation

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "audiocraft-audio-generation" with this command: npx skills add orchestra-research/ai-research-skills/orchestra-research-ai-research-skills-audiocraft-audio-generation

AudioCraft: Audio Generation

Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.

When to use AudioCraft

Use AudioCraft when:

  • Need to generate music from text descriptions

  • Creating sound effects and environmental audio

  • Building music generation applications

  • Need melody-conditioned music generation

  • Want stereo audio output

  • Require controllable music generation with style transfer

Key features:

  • MusicGen: Text-to-music generation with melody conditioning

  • AudioGen: Text-to-sound effects generation

  • EnCodec: High-fidelity neural audio codec

  • Multiple model sizes: Small (300M) to Large (3.3B)

  • Stereo support: Full stereo audio generation

  • Style conditioning: MusicGen-Style for reference-based generation

Use alternatives instead:

  • Stable Audio: For longer commercial music generation

  • Bark: For text-to-speech with music/sound effects

  • Riffusion: For spectogram-based music generation

  • OpenAI Jukebox: For raw audio generation with lyrics

Quick start

Installation

From PyPI

pip install audiocraft

From GitHub (latest)

pip install git+https://github.com/facebookresearch/audiocraft.git

Or use HuggingFace Transformers

pip install transformers torch torchaudio

Basic text-to-music (AudioCraft)

import torchaudio from audiocraft.models import MusicGen

Load model

model = MusicGen.get_pretrained('facebook/musicgen-small')

Set generation parameters

model.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0 )

Generate from text

descriptions = ["happy upbeat electronic dance music with synths"] wav = model.generate(descriptions)

Save audio

torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)

Using HuggingFace Transformers

from transformers import AutoProcessor, MusicgenForConditionalGeneration import scipy

Load model and processor

processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") model.to("cuda")

Generate music

inputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt" ).to("cuda")

audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256 )

Save

sampling_rate = model.config.audio_encoder.sampling_rate scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

Text-to-sound with AudioGen

from audiocraft.models import AudioGen

Load AudioGen

model = AudioGen.get_pretrained('facebook/audiogen-medium')

model.set_generation_params(duration=5)

Generate sound effects

descriptions = ["dog barking in a park with birds chirping"] wav = model.generate(descriptions)

torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)

Core concepts

Architecture overview

AudioCraft Architecture: ┌──────────────────────────────────────────────────────────────┐ │ Text Encoder (T5) │ │ │ │ │ Text Embeddings │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ Transformer Decoder (LM) │ │ Auto-regressively generates audio tokens │ │ Using efficient token interleaving patterns │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ EnCodec Audio Decoder │ │ Converts tokens back to audio waveform │ └──────────────────────────────────────────────────────────────┘

Model variants

Model Size Description Use Case

musicgen-small

300M Text-to-music Quick generation

musicgen-medium

1.5B Text-to-music Balanced

musicgen-large

3.3B Text-to-music Best quality

musicgen-melody

1.5B Text + melody Melody conditioning

musicgen-melody-large

3.3B Text + melody Best melody

musicgen-stereo-*

Varies Stereo output Stereo generation

musicgen-style

1.5B Style transfer Reference-based

audiogen-medium

1.5B Text-to-sound Sound effects

Generation parameters

Parameter Default Description

duration

8.0 Length in seconds (1-120)

top_k

250 Top-k sampling

top_p

0.0 Nucleus sampling (0 = disabled)

temperature

1.0 Sampling temperature

cfg_coef

3.0 Classifier-free guidance

MusicGen usage

Text-to-music generation

from audiocraft.models import MusicGen import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-medium')

Configure generation

model.set_generation_params( duration=30, # Up to 30 seconds top_k=250, # Sampling diversity top_p=0.0, # 0 = use top_k only temperature=1.0, # Creativity (higher = more varied) cfg_coef=3.0 # Text adherence (higher = stricter) )

Generate multiple samples

descriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar" ]

Generate (returns [batch, channels, samples])

wav = model.generate(descriptions)

Save each

for i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)

Melody-conditioned generation

from audiocraft.models import MusicGen import torchaudio

Load melody model

model = MusicGen.get_pretrained('facebook/musicgen-melody') model.set_generation_params(duration=30)

Load melody audio

melody, sr = torchaudio.load("melody.wav")

Generate with melody conditioning

descriptions = ["acoustic guitar folk song"] wav = model.generate_with_chroma(descriptions, melody, sr)

torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)

Stereo generation

from audiocraft.models import MusicGen

Load stereo model

model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') model.set_generation_params(duration=15)

descriptions = ["ambient electronic music with wide stereo panning"] wav = model.generate(descriptions)

wav shape: [batch, 2, samples] for stereo

print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)

Audio continuation

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

Load audio to continue

import torchaudio audio, sr = torchaudio.load("intro.wav")

Process with text and audio

inputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt" )

Generate continuation

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)

MusicGen-Style usage

Style-conditioned generation

from audiocraft.models import MusicGen

Load style model

model = MusicGen.get_pretrained('facebook/musicgen-style')

Configure generation with style

model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # Style influence )

Configure style conditioner

model.set_style_conditioner_params( eval_q=3, # RVQ quantizers (1-6) excerpt_length=3.0 # Style excerpt length )

Load style reference

style_audio, sr = torchaudio.load("reference_style.wav")

Generate with text + style

descriptions = ["upbeat dance track"] wav = model.generate_with_style(descriptions, style_audio, sr)

Style-only generation (no text)

Generate matching style without text prompt

model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # Disable double CFG for style-only )

wav = model.generate_with_style([None], style_audio, sr)

AudioGen usage

Sound effect generation

from audiocraft.models import AudioGen import torchaudio

model = AudioGen.get_pretrained('facebook/audiogen-medium') model.set_generation_params(duration=10)

Generate various sounds

descriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest" ]

wav = model.generate(descriptions)

for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)

EnCodec usage

Audio compression

from audiocraft.models import CompressionModel import torch import torchaudio

Load EnCodec

model = CompressionModel.get_pretrained('facebook/encodec_32khz')

Load audio

wav, sr = torchaudio.load("audio.wav")

Ensure correct sample rate

if sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav)

Encode to tokens

with torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # Audio codes

Decode back to audio

with torch.no_grad(): decoded = model.decode(codes)

torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)

Common workflows

Workflow 1: Music generation pipeline

import torch import torchaudio from audiocraft.models import MusicGen

class MusicGenerator: def init(self, model_name="facebook/musicgen-medium"): self.model = MusicGen.get_pretrained(model_name) self.sample_rate = 32000

def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
    self.model.set_generation_params(
        duration=duration,
        top_k=250,
        temperature=temperature,
        cfg_coef=cfg
    )

    with torch.no_grad():
        wav = self.model.generate([prompt])

    return wav[0].cpu()

def generate_batch(self, prompts, duration=30):
    self.model.set_generation_params(duration=duration)

    with torch.no_grad():
        wav = self.model.generate(prompts)

    return wav.cpu()

def save(self, audio, path):
    torchaudio.save(path, audio, sample_rate=self.sample_rate)

Usage

generator = MusicGenerator() audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0 ) generator.save(audio, "epic_music.wav")

Workflow 2: Sound design batch processing

import json from pathlib import Path from audiocraft.models import AudioGen import torchaudio

def batch_generate_sounds(sound_specs, output_dir): """ Generate multiple sounds from specifications.

Args:
    sound_specs: list of {"name": str, "description": str, "duration": float}
    output_dir: output directory path
"""
model = AudioGen.get_pretrained('facebook/audiogen-medium')
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)

results = []

for spec in sound_specs:
    model.set_generation_params(duration=spec.get("duration", 5))

    wav = model.generate([spec["description"]])

    output_path = output_dir / f"{spec['name']}.wav"
    torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)

    results.append({
        "name": spec["name"],
        "path": str(output_path),
        "description": spec["description"]
    })

return results

Usage

sounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2} ]

results = batch_generate_sounds(sounds, "sound_effects/")

Workflow 3: Gradio demo

import gradio as gr import torch import torchaudio from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

def generate_music(prompt, duration, temperature, cfg_coef): model.set_generation_params( duration=duration, temperature=temperature, cfg_coef=cfg_coef )

with torch.no_grad():
    wav = model.generate([prompt])

# Save to temp file
path = "temp_output.wav"
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
return path

demo = gr.Interface( fn=generate_music, inputs=[ gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), gr.Slider(1, 30, value=8, label="Duration (seconds)"), gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") ], outputs=gr.Audio(label="Generated Music"), title="MusicGen Demo" )

demo.launch()

Performance optimization

Memory optimization

Use smaller model

model = MusicGen.get_pretrained('facebook/musicgen-small')

Clear cache between generations

torch.cuda.empty_cache()

Generate shorter durations

model.set_generation_params(duration=10) # Instead of 30

Use half precision

model = model.half()

Batch processing efficiency

Process multiple prompts at once (more efficient)

descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] wav = model.generate(descriptions) # Single batch

Instead of

for desc in descriptions: wav = model.generate([desc]) # Multiple batches (slower)

GPU memory requirements

Model FP32 VRAM FP16 VRAM

musicgen-small ~4GB ~2GB

musicgen-medium ~8GB ~4GB

musicgen-large ~16GB ~8GB

Common issues

Issue Solution

CUDA OOM Use smaller model, reduce duration

Poor quality Increase cfg_coef, better prompts

Generation too short Check max duration setting

Audio artifacts Try different temperature

Stereo not working Use stereo model variant

References

  • Advanced Usage - Training, fine-tuning, deployment

  • Troubleshooting - Common issues and solutions

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

mlflow

No summary provided by upstream source.

Repository SourceNeeds Review
Research

faiss

No summary provided by upstream source.

Repository SourceNeeds Review
Research

serving-llms-vllm

No summary provided by upstream source.

Repository SourceNeeds Review