nnsight-remote-interpretability

nnsight: Transparent Access to Neural Network Internals

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "nnsight-remote-interpretability" with this command: npx skills add zechenzhangagi/ai-research-skills/zechenzhangagi-ai-research-skills-nnsight-remote-interpretability

nnsight: Transparent Access to Neural Network Internals

nnsight (/ɛn.saɪt/) enables researchers to interpret and manipulate the internals of any PyTorch model, with the unique capability of running the same code locally on small models or remotely on massive models (70B+) via NDIF.

GitHub: ndif-team/nnsight (730+ stars) Paper: NNsight and NDIF: Democratizing Access to Foundation Model Internals (ICLR 2025)

Key Value Proposition

Write once, run anywhere: The same interpretability code works on GPT-2 locally or Llama-3.1-405B remotely. Just toggle remote=True .

Local execution (small model)

with model.trace("Hello world"): hidden = model.transformer.h[5].output[0].save()

Remote execution (massive model) - same code!

with model.trace("Hello world", remote=True): hidden = model.model.layers[40].output[0].save()

When to Use nnsight

Use nnsight when you need to:

  • Run interpretability experiments on models too large for local GPUs (70B, 405B)

  • Work with any PyTorch architecture (transformers, Mamba, custom models)

  • Perform multi-token generation interventions

  • Share activations between different prompts

  • Access full model internals without reimplementation

Consider alternatives when:

  • You want consistent API across models → Use TransformerLens

  • You need declarative, shareable interventions → Use pyvene

  • You're training SAEs → Use SAELens

  • You only work with small models locally → TransformerLens may be simpler

Installation

Basic installation

pip install nnsight

For vLLM support

pip install "nnsight[vllm]"

For remote NDIF execution, sign up at login.ndif.us for an API key.

Core Concepts

LanguageModel Wrapper

from nnsight import LanguageModel

Load model (uses HuggingFace under the hood)

model = LanguageModel("openai-community/gpt2", device_map="auto")

For larger models

model = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")

Tracing Context

The trace context manager enables deferred execution - operations are collected into a computation graph:

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The Eiffel Tower is in") as tracer: # Access any module's output hidden_states = model.transformer.h[5].output[0].save()

# Access attention patterns
attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()

# Modify activations
model.transformer.h[8].output[0][:] = 0  # Zero out layer 8

# Get final output
logits = model.output.save()

After context exits, access saved values

print(hidden_states.shape) # [batch, seq, hidden]

Proxy Objects

Inside trace , module accesses return Proxy objects that record operations:

with model.trace("Hello"): # These are all Proxy objects - operations are deferred h5_out = model.transformer.h[5].output[0] # Proxy h5_mean = h5_out.mean(dim=-1) # Proxy h5_saved = h5_mean.save() # Save for later access

Workflow 1: Activation Analysis

Step-by-Step

from nnsight import LanguageModel import torch

model = LanguageModel("gpt2", device_map="auto")

prompt = "The capital of France is"

with model.trace(prompt) as tracer: # 1. Collect activations from multiple layers layer_outputs = [] for i in range(12): # GPT-2 has 12 layers layer_out = model.transformer.h[i].output[0].save() layer_outputs.append(layer_out)

# 2. Get attention patterns
attn_patterns = []
for i in range(12):
    # Access attention weights (after softmax)
    attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
    attn_patterns.append(attn)

# 3. Get final logits
logits = model.output.save()

4. Analyze outside context

for i, layer_out in enumerate(layer_outputs): print(f"Layer {i} output shape: {layer_out.shape}") print(f"Layer {i} norm: {layer_out.norm().item():.3f}")

5. Find top predictions

probs = torch.softmax(logits[0, -1], dim=-1) top_tokens = probs.topk(5) for token, prob in zip(top_tokens.indices, top_tokens.values): print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")

Checklist

  • Load model with LanguageModel wrapper

  • Use trace context for operations

  • Call .save() on values you need after context

  • Access saved values outside context

  • Use .shape , .norm() , etc. for analysis

Workflow 2: Activation Patching

Step-by-Step

from nnsight import LanguageModel import torch

model = LanguageModel("gpt2", device_map="auto")

clean_prompt = "The Eiffel Tower is in" corrupted_prompt = "The Colosseum is in"

1. Get clean activations

with model.trace(clean_prompt) as tracer: clean_hidden = model.transformer.h[8].output[0].save()

2. Patch clean into corrupted run

with model.trace(corrupted_prompt) as tracer: # Replace layer 8 output with clean activations model.transformer.h[8].output[0][:] = clean_hidden

patched_logits = model.output.save()

3. Compare predictions

paris_token = model.tokenizer.encode(" Paris")[0] rome_token = model.tokenizer.encode(" Rome")[0]

patched_probs = torch.softmax(patched_logits[0, -1], dim=-1) print(f"Paris prob: {patched_probs[paris_token].item():.3f}") print(f"Rome prob: {patched_probs[rome_token].item():.3f}")

Systematic Patching Sweep

def patch_layer_position(layer, position, clean_cache, corrupted_prompt): """Patch single layer/position from clean to corrupted.""" with model.trace(corrupted_prompt) as tracer: # Get current activation current = model.transformer.h[layer].output[0]

    # Patch only specific position
    current[:, position, :] = clean_cache[layer][:, position, :]

    logits = model.output.save()

return logits

Sweep over all layers and positions

results = torch.zeros(12, seq_len) for layer in range(12): for pos in range(seq_len): logits = patch_layer_position(layer, pos, clean_hidden, corrupted) results[layer, pos] = compute_metric(logits)

Workflow 3: Remote Execution with NDIF

Run the same experiments on massive models without local GPUs.

Step-by-Step

from nnsight import LanguageModel

1. Load large model (will run remotely)

model = LanguageModel("meta-llama/Llama-3.1-70B")

2. Same code, just add remote=True

with model.trace("The meaning of life is", remote=True) as tracer: # Access internals of 70B model! layer_40_out = model.model.layers[40].output[0].save() logits = model.output.save()

3. Results returned from NDIF

print(f"Layer 40 shape: {layer_40_out.shape}")

4. Generation with interventions

with model.trace(remote=True) as tracer: with tracer.invoke("What is 2+2?"): # Intervene during generation model.model.layers[20].output[0][:, -1, :] *= 1.5

output = model.generate(max_new_tokens=50)

NDIF Setup

  • Sign up at login.ndif.us

  • Get API key

  • Set environment variable or pass to nnsight:

import os os.environ["NDIF_API_KEY"] = "your_key"

Or configure directly

from nnsight import CONFIG CONFIG.API_KEY = "your_key"

Available Models on NDIF

  • Llama-3.1-8B, 70B, 405B

  • DeepSeek-R1 models

  • Various open-weight models (check ndif.us for current list)

Workflow 4: Cross-Prompt Activation Sharing

Share activations between different inputs in a single trace.

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace() as tracer: # First prompt with tracer.invoke("The cat sat on the"): cat_hidden = model.transformer.h[6].output[0].save()

# Second prompt - inject cat's activations
with tracer.invoke("The dog ran through the"):
    # Replace with cat's activations at layer 6
    model.transformer.h[6].output[0][:] = cat_hidden
    dog_with_cat = model.output.save()

The dog prompt now has cat's internal representations

Workflow 5: Gradient-Based Analysis

Access gradients during backward pass.

from nnsight import LanguageModel import torch

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The quick brown fox") as tracer: # Save activations and enable gradient hidden = model.transformer.h[5].output[0].save() hidden.retain_grad()

logits = model.output

# Compute loss on specific token
target_token = model.tokenizer.encode(" jumps")[0]
loss = -logits[0, -1, target_token]

# Backward pass
loss.backward()

Access gradients

grad = hidden.grad print(f"Gradient shape: {grad.shape}") print(f"Gradient norm: {grad.norm().item():.3f}")

Note: Gradient access not supported for vLLM or remote execution.

Common Issues & Solutions

Issue: Module path differs between models

GPT-2 structure

model.transformer.h[5].output[0]

LLaMA structure

model.model.layers[5].output[0]

Solution: Check model structure

print(model._model) # See actual module names

Issue: Forgetting to save

WRONG: Value not accessible outside trace

with model.trace("Hello"): hidden = model.transformer.h[5].output[0] # Not saved!

print(hidden) # Error or wrong value

RIGHT: Call .save()

with model.trace("Hello"): hidden = model.transformer.h[5].output[0].save()

print(hidden) # Works!

Issue: Remote timeout

For long operations, increase timeout

with model.trace("prompt", remote=True, timeout=300) as tracer: # Long operation...

Issue: Memory with many saved activations

Only save what you need

with model.trace("prompt"): # Don't save everything for i in range(100): model.transformer.h[i].output[0].save() # Memory heavy!

# Better: save specific layers
key_layers = [0, 5, 11]
for i in key_layers:
    model.transformer.h[i].output[0].save()

Issue: vLLM gradient limitation

vLLM doesn't support gradients

Use standard execution for gradient analysis

model = LanguageModel("gpt2", device_map="auto") # Not vLLM

Key API Reference

Method/Property Purpose

model.trace(prompt, remote=False)

Start tracing context

proxy.save()

Save value for access after trace

proxy[:]

Slice/index proxy (assignment patches)

tracer.invoke(prompt)

Add prompt within trace

model.generate(...)

Generate with interventions

model.output

Final model output logits

model._model

Underlying HuggingFace model

Comparison with Other Tools

Feature nnsight TransformerLens pyvene

Any architecture Yes Transformers only Yes

Remote execution Yes (NDIF) No No

Consistent API No Yes Yes

Deferred execution Yes No No

HuggingFace native Yes Reimplemented Yes

Shareable configs No No Yes

Reference Documentation

For detailed API documentation, tutorials, and advanced usage, see the references/ folder:

File Contents

references/README.md Overview and quick start guide

references/api.md Complete API reference for LanguageModel, tracing, proxy objects

references/tutorials.md Step-by-step tutorials for local and remote interpretability

External Resources

Tutorials

  • Getting Started

  • Features Overview

  • Remote Execution

  • Applied Tutorials

Official Documentation

  • Official Docs

  • NDIF Info

  • Community Forum

Papers

  • NNsight and NDIF Paper - Fiotto-Kaufman et al. (ICLR 2025)

Architecture Support

nnsight works with any PyTorch model:

  • Transformers: GPT-2, LLaMA, Mistral, etc.

  • State Space Models: Mamba

  • Vision Models: ViT, CLIP

  • Custom architectures: Any nn.Module

The key is knowing the module structure to access the right components.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

nnsight-remote-interpretability

No summary provided by upstream source.

Repository SourceNeeds Review
Research

qdrant-vector-search

No summary provided by upstream source.

Repository SourceNeeds Review
Research

peft-fine-tuning

No summary provided by upstream source.

Repository SourceNeeds Review