mamba-architecture

Mamba - Selective State Space Models

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "mamba-architecture" with this command: npx skills add davila7/claude-code-templates/davila7-claude-code-templates-mamba-architecture

Mamba - Selective State Space Models

Quick start

Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.

Installation:

Install causal-conv1d (optional, for efficiency)

pip install causal-conv1d>=1.4.0

Install Mamba

pip install mamba-ssm

Or both together

pip install mamba-ssm[causal-conv1d]

Prerequisites: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+

Basic usage (Mamba block):

import torch from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda")

model = Mamba( d_model=dim, # Model dimension d_state=16, # SSM state dimension d_conv=4, # Conv1d kernel size expand=2 # Expansion factor ).to("cuda")

y = model(x) # O(n) complexity! assert y.shape == x.shape

Common workflows

Workflow 1: Language model with Mamba-2

Complete LM with generation:

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel from mamba_ssm.models.config_mamba import MambaConfig import torch

Configure Mamba-2 LM

config = MambaConfig( d_model=1024, # Hidden dimension n_layer=24, # Number of layers vocab_size=50277, # Vocabulary size ssm_cfg=dict( layer="Mamba2", # Use Mamba-2 d_state=128, # Larger state for Mamba-2 headdim=64, # Head dimension ngroups=1 # Number of groups ) )

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)

Generate text

input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long) output = model.generate( input_ids=input_ids, max_length=100, temperature=0.7, top_p=0.9 )

Workflow 2: Use pretrained Mamba models

Load from HuggingFace:

from transformers import AutoTokenizer from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

Load pretrained model

model_name = "state-spaces/mamba-2.8b" tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)

Generate

prompt = "The future of AI is" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") output_ids = model.generate( input_ids=input_ids, max_length=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2 ) generated_text = tokenizer.decode(output_ids[0]) print(generated_text)

Available models:

  • state-spaces/mamba-130m

  • state-spaces/mamba-370m

  • state-spaces/mamba-790m

  • state-spaces/mamba-1.4b

  • state-spaces/mamba-2.8b

Workflow 3: Mamba-1 vs Mamba-2

Mamba-1 (smaller state):

from mamba_ssm import Mamba

model = Mamba( d_model=256, d_state=16, # Smaller state dimension d_conv=4, expand=2 ).to("cuda")

Mamba-2 (multi-head, larger state):

from mamba_ssm import Mamba2

model = Mamba2( d_model=256, d_state=128, # Larger state dimension d_conv=4, expand=2, headdim=64, # Head dimension for multi-head ngroups=1 # Parallel groups ).to("cuda")

Key differences:

  • State size: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)

  • Architecture: Mamba-2 has multi-head structure

  • Normalization: Mamba-2 uses RMSNorm

  • Distributed: Mamba-2 supports tensor parallelism

Workflow 4: Benchmark vs Transformers

Generation speed comparison:

Benchmark Mamba

python benchmarks/benchmark_generation_mamba_simple.py
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2

Benchmark Transformer

python benchmarks/benchmark_generation_mamba_simple.py
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2

Expected results:

  • Mamba: 5× faster inference

  • Memory: No KV cache needed

  • Scaling: Linear with sequence length

When to use vs alternatives

Use Mamba when:

  • Need long sequences (100K+ tokens)

  • Want faster inference than Transformers

  • Memory-constrained (no KV cache)

  • Building streaming applications

  • Linear scaling important

Advantages:

  • O(n) complexity: Linear vs quadratic

  • 5× faster inference: No attention overhead

  • No KV cache: Lower memory usage

  • Million-token sequences: Hardware-efficient

  • Streaming: Constant memory per token

Use alternatives instead:

  • Transformers: Need best-in-class performance, have compute

  • RWKV: Want RNN+Transformer hybrid

  • RetNet: Need retention-based architecture

  • Hyena: Want convolution-based approach

Common issues

Issue: CUDA out of memory

Reduce batch size or use gradient checkpointing:

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16) model.gradient_checkpointing_enable() # Enable checkpointing

Issue: Slow installation

Install binary wheels (not source):

pip install mamba-ssm --no-build-isolation

Issue: Missing causal-conv1d

Install separately:

pip install causal-conv1d>=1.4.0

Issue: Model not loading from HuggingFace

Use MambaLMHeadModel.from_pretrained (not AutoModel ):

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")

Advanced topics

Selective SSM: See references/selective-ssm.md for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.

Mamba-2 architecture: See references/mamba2-details.md for multi-head structure, tensor parallelism, and distributed training setup.

Performance optimization: See references/performance.md for hardware-aware design, CUDA kernels, and memory efficiency techniques.

Hardware requirements

  • GPU: NVIDIA with CUDA 11.6+

  • VRAM:

  • 130M model: 2GB

  • 370M model: 4GB

  • 790M model: 8GB

  • 1.4B model: 14GB

  • 2.8B model: 28GB (FP16)

  • Inference: 5× faster than Transformers

  • Memory: No KV cache (lower than Transformers)

Performance (vs Transformers):

  • Speed: 5× faster inference

  • Memory: 50% less (no KV cache)

  • Scaling: Linear vs quadratic

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

senior-data-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

senior-backend

No summary provided by upstream source.

Repository SourceNeeds Review
1.2K-davila7
Coding

senior-frontend

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ui-ux-pro-max

No summary provided by upstream source.

Repository SourceNeeds Review