AudioCraft: Audio Generation
Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
When to use AudioCraft
Use AudioCraft when:
-
Need to generate music from text descriptions
-
Creating sound effects and environmental audio
-
Building music generation applications
-
Need melody-conditioned music generation
-
Want stereo audio output
-
Require controllable music generation with style transfer
Key features:
-
MusicGen: Text-to-music generation with melody conditioning
-
AudioGen: Text-to-sound effects generation
-
EnCodec: High-fidelity neural audio codec
-
Multiple model sizes: Small (300M) to Large (3.3B)
-
Stereo support: Full stereo audio generation
-
Style conditioning: MusicGen-Style for reference-based generation
Use alternatives instead:
-
Stable Audio: For longer commercial music generation
-
Bark: For text-to-speech with music/sound effects
-
Riffusion: For spectogram-based music generation
-
OpenAI Jukebox: For raw audio generation with lyrics
Quick start
Installation
From PyPI
pip install audiocraft
From GitHub (latest)
pip install git+https://github.com/facebookresearch/audiocraft.git
Or use HuggingFace Transformers
pip install transformers torch torchaudio
Basic text-to-music (AudioCraft)
import torchaudio from audiocraft.models import MusicGen
Load model
model = MusicGen.get_pretrained('facebook/musicgen-small')
Set generation parameters
model.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0 )
Generate from text
descriptions = ["happy upbeat electronic dance music with synths"] wav = model.generate(descriptions)
Save audio
torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
Using HuggingFace Transformers
from transformers import AutoProcessor, MusicgenForConditionalGeneration import scipy
Load model and processor
processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") model.to("cuda")
Generate music
inputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt" ).to("cuda")
audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256 )
Save
sampling_rate = model.config.audio_encoder.sampling_rate scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
Text-to-sound with AudioGen
from audiocraft.models import AudioGen
Load AudioGen
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
Generate sound effects
descriptions = ["dog barking in a park with birds chirping"] wav = model.generate(descriptions)
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
Core concepts
Architecture overview
AudioCraft Architecture: ┌──────────────────────────────────────────────────────────────┐ │ Text Encoder (T5) │ │ │ │ │ Text Embeddings │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ Transformer Decoder (LM) │ │ Auto-regressively generates audio tokens │ │ Using efficient token interleaving patterns │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ EnCodec Audio Decoder │ │ Converts tokens back to audio waveform │ └──────────────────────────────────────────────────────────────┘
Model variants
Model Size Description Use Case
musicgen-small
300M Text-to-music Quick generation
musicgen-medium
1.5B Text-to-music Balanced
musicgen-large
3.3B Text-to-music Best quality
musicgen-melody
1.5B Text + melody Melody conditioning
musicgen-melody-large
3.3B Text + melody Best melody
musicgen-stereo-*
Varies Stereo output Stereo generation
musicgen-style
1.5B Style transfer Reference-based
audiogen-medium
1.5B Text-to-sound Sound effects
Generation parameters
Parameter Default Description
duration
8.0 Length in seconds (1-120)
top_k
250 Top-k sampling
top_p
0.0 Nucleus sampling (0 = disabled)
temperature
1.0 Sampling temperature
cfg_coef
3.0 Classifier-free guidance
MusicGen usage
Text-to-music generation
from audiocraft.models import MusicGen import torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-medium')
Configure generation
model.set_generation_params( duration=30, # Up to 30 seconds top_k=250, # Sampling diversity top_p=0.0, # 0 = use top_k only temperature=1.0, # Creativity (higher = more varied) cfg_coef=3.0 # Text adherence (higher = stricter) )
Generate multiple samples
descriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar" ]
Generate (returns [batch, channels, samples])
wav = model.generate(descriptions)
Save each
for i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
Melody-conditioned generation
from audiocraft.models import MusicGen import torchaudio
Load melody model
model = MusicGen.get_pretrained('facebook/musicgen-melody') model.set_generation_params(duration=30)
Load melody audio
melody, sr = torchaudio.load("melody.wav")
Generate with melody conditioning
descriptions = ["acoustic guitar folk song"] wav = model.generate_with_chroma(descriptions, melody, sr)
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
Stereo generation
from audiocraft.models import MusicGen
Load stereo model
model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') model.set_generation_params(duration=15)
descriptions = ["ambient electronic music with wide stereo panning"] wav = model.generate(descriptions)
wav shape: [batch, 2, samples] for stereo
print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
Audio continuation
from transformers import AutoProcessor, MusicgenForConditionalGeneration
processor = AutoProcessor.from_pretrained("facebook/musicgen-medium") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
Load audio to continue
import torchaudio audio, sr = torchaudio.load("intro.wav")
Process with text and audio
inputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt" )
Generate continuation
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
MusicGen-Style usage
Style-conditioned generation
from audiocraft.models import MusicGen
Load style model
model = MusicGen.get_pretrained('facebook/musicgen-style')
Configure generation with style
model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # Style influence )
Configure style conditioner
model.set_style_conditioner_params( eval_q=3, # RVQ quantizers (1-6) excerpt_length=3.0 # Style excerpt length )
Load style reference
style_audio, sr = torchaudio.load("reference_style.wav")
Generate with text + style
descriptions = ["upbeat dance track"] wav = model.generate_with_style(descriptions, style_audio, sr)
Style-only generation (no text)
Generate matching style without text prompt
model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # Disable double CFG for style-only )
wav = model.generate_with_style([None], style_audio, sr)
AudioGen usage
Sound effect generation
from audiocraft.models import AudioGen import torchaudio
model = AudioGen.get_pretrained('facebook/audiogen-medium') model.set_generation_params(duration=10)
Generate various sounds
descriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest" ]
wav = model.generate(descriptions)
for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
EnCodec usage
Audio compression
from audiocraft.models import CompressionModel import torch import torchaudio
Load EnCodec
model = CompressionModel.get_pretrained('facebook/encodec_32khz')
Load audio
wav, sr = torchaudio.load("audio.wav")
Ensure correct sample rate
if sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav)
Encode to tokens
with torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # Audio codes
Decode back to audio
with torch.no_grad(): decoded = model.decode(codes)
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
Common workflows
Workflow 1: Music generation pipeline
import torch import torchaudio from audiocraft.models import MusicGen
class MusicGenerator: def init(self, model_name="facebook/musicgen-medium"): self.model = MusicGen.get_pretrained(model_name) self.sample_rate = 32000
def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
self.model.set_generation_params(
duration=duration,
top_k=250,
temperature=temperature,
cfg_coef=cfg
)
with torch.no_grad():
wav = self.model.generate([prompt])
return wav[0].cpu()
def generate_batch(self, prompts, duration=30):
self.model.set_generation_params(duration=duration)
with torch.no_grad():
wav = self.model.generate(prompts)
return wav.cpu()
def save(self, audio, path):
torchaudio.save(path, audio, sample_rate=self.sample_rate)
Usage
generator = MusicGenerator() audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0 ) generator.save(audio, "epic_music.wav")
Workflow 2: Sound design batch processing
import json from pathlib import Path from audiocraft.models import AudioGen import torchaudio
def batch_generate_sounds(sound_specs, output_dir): """ Generate multiple sounds from specifications.
Args:
sound_specs: list of {"name": str, "description": str, "duration": float}
output_dir: output directory path
"""
model = AudioGen.get_pretrained('facebook/audiogen-medium')
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)
results = []
for spec in sound_specs:
model.set_generation_params(duration=spec.get("duration", 5))
wav = model.generate([spec["description"]])
output_path = output_dir / f"{spec['name']}.wav"
torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
results.append({
"name": spec["name"],
"path": str(output_path),
"description": spec["description"]
})
return results
Usage
sounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2} ]
results = batch_generate_sounds(sounds, "sound_effects/")
Workflow 3: Gradio demo
import gradio as gr import torch import torchaudio from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('facebook/musicgen-small')
def generate_music(prompt, duration, temperature, cfg_coef): model.set_generation_params( duration=duration, temperature=temperature, cfg_coef=cfg_coef )
with torch.no_grad():
wav = model.generate([prompt])
# Save to temp file
path = "temp_output.wav"
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
return path
demo = gr.Interface( fn=generate_music, inputs=[ gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), gr.Slider(1, 30, value=8, label="Duration (seconds)"), gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") ], outputs=gr.Audio(label="Generated Music"), title="MusicGen Demo" )
demo.launch()
Performance optimization
Memory optimization
Use smaller model
model = MusicGen.get_pretrained('facebook/musicgen-small')
Clear cache between generations
torch.cuda.empty_cache()
Generate shorter durations
model.set_generation_params(duration=10) # Instead of 30
Use half precision
model = model.half()
Batch processing efficiency
Process multiple prompts at once (more efficient)
descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] wav = model.generate(descriptions) # Single batch
Instead of
for desc in descriptions: wav = model.generate([desc]) # Multiple batches (slower)
GPU memory requirements
Model FP32 VRAM FP16 VRAM
musicgen-small ~4GB ~2GB
musicgen-medium ~8GB ~4GB
musicgen-large ~16GB ~8GB
Common issues
Issue Solution
CUDA OOM Use smaller model, reduce duration
Poor quality Increase cfg_coef, better prompts
Generation too short Check max duration setting
Audio artifacts Try different temperature
Stereo not working Use stereo model variant
References
-
Advanced Usage - Training, fine-tuning, deployment
-
Troubleshooting - Common issues and solutions
Resources
-
Paper (MusicGen): https://arxiv.org/abs/2306.05284
-
Paper (AudioGen): https://arxiv.org/abs/2209.15352
-
HuggingFace: https://huggingface.co/facebook/musicgen-small