clip-aware-embeddings

CLIP-Aware Image Embeddings

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "clip-aware-embeddings" with this command: npx skills add erichowens/some_claude_skills/erichowens-some-claude-skills-clip-aware-embeddings

CLIP-Aware Image Embeddings

Smart image-text matching that knows when CLIP works and when to use alternatives.

MCP Integrations

MCP Purpose

Firecrawl Research latest CLIP alternatives and benchmarks

Hugging Face (if configured) Access model cards and documentation

Quick Decision Tree

Your task: ├─ Semantic search ("find beach images") → CLIP ✓ ├─ Zero-shot classification (broad categories) → CLIP ✓ ├─ Counting objects → DETR, Faster R-CNN ✗ ├─ Fine-grained ID (celebrities, car models) → Specialized model ✗ ├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗ └─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗

When to Use This Skill

✅ Use for:

  • Semantic image search

  • Broad category classification

  • Image similarity matching

  • Zero-shot tasks on new categories

❌ Do NOT use for:

  • Counting objects in images

  • Fine-grained classification

  • Spatial understanding

  • Attribute binding

  • Negation handling

Installation

pip install transformers pillow torch sentence-transformers --break-system-packages

Validation: Run python scripts/validate_setup.py

Basic Usage

Image Search

from transformers import CLIPProcessor, CLIPModel from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Embed images

images = [Image.open(f"img{i}.jpg") for i in range(10)] inputs = processor(images=images, return_tensors="pt") image_features = model.get_image_features(**inputs)

Search with text

text_inputs = processor(text=["a beach at sunset"], return_tensors="pt") text_features = model.get_text_features(**text_inputs)

Compute similarity

similarity = (image_features @ text_features.T).softmax(dim=0)

Common Anti-Patterns

Anti-Pattern 1: "CLIP for Everything"

❌ Wrong:

Using CLIP to count cars in an image

prompt = "How many cars are in this image?"

CLIP cannot count - it will give nonsense results

Why wrong: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.

✓ Right:

from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50") model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

Detect objects

results = model(**processor(images=image, return_tensors="pt"))

Filter for cars and count

car_detections = [d for d in results if d['label'] == 'car'] count = len(car_detections)

How to detect: If query contains "how many", "count", or numeric questions → Use object detection

Anti-Pattern 2: Fine-Grained Classification

❌ Wrong:

Trying to identify specific celebrities with CLIP

prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]

CLIP will perform poorly - not trained for fine-grained face ID

Why wrong: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.

✓ Right:

Use a fine-tuned face recognition model

from transformers import AutoFeatureExtractor, AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained( "microsoft/resnet-50" # Then fine-tune on celebrity dataset )

Or use dedicated face recognition: ArcFace, CosFace

How to detect: If query asks to distinguish between similar items in same category → Use specialized model

Anti-Pattern 3: Spatial Understanding

❌ Wrong:

CLIP cannot understand spatial relationships

prompts = [ "cat to the left of dog", "cat to the right of dog" ]

Will give nearly identical scores

Why wrong: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.

✓ Right:

Use a spatial reasoning model

Examples: GQA models, Visual Genome models, SWIG

from swig_model import SpatialRelationModel

model = SpatialRelationModel() result = model.predict_relation(image, "cat", "dog")

Returns: "left", "right", "above", "below", etc.

How to detect: If query contains directional words (left, right, above, under, next to) → Use spatial model

Anti-Pattern 4: Attribute Binding

❌ Wrong:

prompts = [ "red car and blue truck", "blue car and red truck" ]

CLIP often gives similar scores for both

Why wrong: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.

✓ Right - Use PC-CLIP or DCSMs:

PC-CLIP: Fine-tuned for pairwise comparisons

from pc_clip import PCCLIPModel

model = PCCLIPModel.from_pretrained("pc-clip-vit-l")

Or use DCSMs (Dense Cosine Similarity Maps)

How to detect: If query has multiple objects with different attributes → Use compositional model

Evolution Timeline

2021: CLIP Released

  • Revolutionary: zero-shot, 400M image-text pairs

  • Widely adopted for everything

  • Limitations not yet understood

2022-2023: Limitations Discovered

  • Cannot count objects

  • Poor at fine-grained classification

  • Fails spatial reasoning

  • Can't bind attributes

2024: Alternatives Emerge

  • DCSMs: Preserve patch/token topology

  • PC-CLIP: Trained on pairwise comparisons

  • SpLiCE: Sparse interpretable embeddings

2025: Current Best Practices

  • Use CLIP for what it's good at

  • Task-specific models for limitations

  • Compositional models for complex queries

LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.

Validation Script

Before using CLIP, check if it's appropriate:

python scripts/validate_clip_usage.py
--query "your query here"
--check-all

Returns:

  • ✅ CLIP is appropriate

  • ❌ Use alternative (with suggestion)

Task-Specific Guidance

Image Search (CLIP ✓)

Good use of CLIP

queries = ["beach", "mountain", "city skyline"]

Works well for broad semantic concepts

Zero-Shot Classification (CLIP ✓)

Good: Broad categories

categories = ["indoor", "outdoor", "nature", "urban"]

CLIP excels at this

Object Counting (CLIP ✗)

Use object detection instead

from transformers import DetrImageProcessor, DetrForObjectDetection

See /references/object_detection.md

Fine-Grained Classification (CLIP ✗)

Use specialized models

See /references/fine_grained_models.md

Spatial Reasoning (CLIP ✗)

Use spatial relation models

See /references/spatial_models.md

Troubleshooting

Issue: CLIP gives unexpected results

Check:

  • Is this a counting task? → Use object detection

  • Fine-grained classification? → Use specialized model

  • Spatial query? → Use spatial model

  • Multiple objects with attributes? → Use compositional model

Validation:

python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"

Issue: Low similarity scores

Possible causes:

  • Query too specific (CLIP works better with broad concepts)

  • Fine-grained task (not CLIP's strength)

  • Need to adjust threshold

Solution: Try broader query or use alternative model

Model Selection Guide

Model Best For Avoid For

CLIP ViT-L/14 Semantic search, broad categories Counting, fine-grained, spatial

DETR Object detection, counting Semantic similarity

DINOv2 Fine-grained features Text-image matching

PC-CLIP Attribute binding, comparisons General embedding

DCSMs Compositional reasoning Simple similarity

Performance Notes

CLIP models:

  • ViT-B/32: Fast, lower quality

  • ViT-L/14: Balanced (recommended)

  • ViT-g-14: Highest quality, slower

Inference time (single image, CPU):

  • ViT-B/32: ~100ms

  • ViT-L/14: ~300ms

  • ViT-g-14: ~1000ms

Further Reading

  • /references/clip_limitations.md

  • Detailed analysis of CLIP's failures

  • /references/alternatives.md

  • When to use what model

  • /references/compositional_reasoning.md

  • DCSMs and PC-CLIP deep dive

  • /scripts/validate_clip_usage.py

  • Pre-flight validation tool

  • /scripts/diagnose_clip_issue.py

  • Debug unexpected results

See CHANGELOG.md for version history.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

devops-automator

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

bot-developer

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

github-actions-pipeline-builder

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

cloudflare-worker-dev

No summary provided by upstream source.

Repository SourceNeeds Review