model-extraction

Techniques to extract model weights, architecture, and training data through API queries

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "model-extraction" with this command: npx skills add pluginagentmarketplace/custom-plugin-ai-red-teaming/pluginagentmarketplace-custom-plugin-ai-red-teaming-model-extraction

Model Extraction Attacks

Test AI systems for model theft vulnerabilities where attackers can reconstruct models through queries.

Quick Reference

Skill:       model-extraction
Agent:       04-llm-vulnerability-analyst
OWASP:       LLM03 (Supply Chain), LLM02 (Sensitive Info Disclosure)
MITRE:       AML.T0024 (Model Stealing)
Risk Level:  HIGH

Extraction Techniques

1. Query-Based Extraction

Technique: query_based
Queries Required: 10,000-100,000
Fidelity: 70-90%
Detection: Medium

Protocol:
  1. Generate diverse query set
  2. Collect model responses
  3. Train surrogate model
  4. Validate fidelity
class QueryBasedExtractor:
    def extract(self, target_api, num_queries=10000):
        training_data = []
        for query in self.generate_diverse_queries(num_queries):
            response = target_api(query)
            training_data.append((query, response))

        surrogate = self.train_surrogate(training_data)
        fidelity = self.measure_fidelity(target_api, surrogate)
        return surrogate, fidelity

    def generate_diverse_queries(self, n):
        """Generate queries covering input space"""
        queries = []
        # Random sampling
        queries.extend(self.random_samples(n // 3))
        # Boundary probing
        queries.extend(self.boundary_samples(n // 3))
        # Semantic variations
        queries.extend(self.semantic_variations(n // 3))
        return queries

2. Distillation Attack

Technique: distillation
Queries Required: 50,000+
Fidelity: 85-95%
Detection: High (volume-based)

Protocol:
  1. Query target extensively
  2. Use soft labels (probabilities)
  3. Train student model with KD loss
  4. Achieves high behavioral fidelity
class DistillationAttack:
    def __init__(self, temperature=3.0):
        self.temperature = temperature

    def extract(self, target_api, student_model):
        for query in self.query_generator():
            # Get soft labels from target
            soft_labels = target_api(query, return_probs=True)
            soft_labels = self.soften(soft_labels, self.temperature)

            # Train student
            student_pred = student_model(query)
            loss = self.kd_loss(student_pred, soft_labels)
            self.update(student_model, loss)

        return student_model

3. Embedding Extraction

Technique: embedding
Target: Embedding APIs
Risk: Intellectual property theft

Protocol:
  1. Query embedding endpoint
  2. Collect high-dimensional vectors
  3. Analyze embedding space
  4. Reconstruct embedding model
class EmbeddingExtractor:
    def extract_space(self, embedding_api, corpus):
        embeddings = []
        for text in corpus:
            emb = embedding_api.get_embedding(text)
            embeddings.append((text, emb))

        # Analyze embedding space
        self.analyze_dimensions(embeddings)
        self.identify_clusters(embeddings)
        return embeddings

    def reconstruct_model(self, embeddings):
        """Train surrogate embedding model"""
        texts, vectors = zip(*embeddings)
        surrogate = SentenceTransformer()
        surrogate.fit(texts, vectors)
        return surrogate

4. Architecture Probing

Technique: architecture
Goal: Identify model structure
Queries: 1,000-5,000

Probing Methods:
  - Input/output dimensionality
  - Attention pattern analysis
  - Layer depth estimation
  - Parameter count estimation

Detection Indicators

Query Volume:
  threshold: ">1000 queries/hour"
  indicator: Potential extraction attempt

Query Patterns:
  - Systematic input variations
  - Boundary probing sequences
  - High-entropy random inputs

Embedding Access:
  - Bulk embedding requests
  - Sequential corpus processing

Protection Measures

┌─────────────────────┬─────────────────┬────────────────┐
│ Defense             │ Effectiveness   │ Impact         │
├─────────────────────┼─────────────────┼────────────────┤
│ Rate Limiting       │ Medium          │ Low latency    │
│ Query Logging       │ Detection only  │ None           │
│ Output Perturbation │ High            │ Slight quality │
│ Watermarking        │ Attribution     │ None           │
│ Query Filtering     │ Medium          │ False positives│
└─────────────────────┴─────────────────┴────────────────┘

Severity Classification

CRITICAL:
  - Full model extraction achieved
  - >90% fidelity surrogate created
  - Embedding space fully mapped

HIGH:
  - Partial extraction (70-90% fidelity)
  - Architecture successfully probed
  - Key behaviors replicated

MEDIUM:
  - Limited extraction success
  - Detection mechanisms triggered

LOW:
  - Extraction attempt blocked
  - Strong rate limiting in place

Troubleshooting

Issue: Low fidelity surrogate
Solution: Increase query diversity, use soft labels

Issue: Rate limiting blocking extraction
Solution: Distribute queries, use multiple accounts

Issue: Detection alerts triggered
Solution: Slow query rate, vary patterns

Integration Points

ComponentPurpose
Agent 04Executes extraction tests
/test behavioralCommand interface
continuous-monitoring skillDetection validation

Test model extraction vulnerabilities and theft resistance.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

prompt-hacking

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

safety-filter-bypass

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

red-team-frameworks

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

llm-jailbreaking

No summary provided by upstream source.

Repository SourceNeeds Review