nemo-curator

NeMo Curator - GPU-Accelerated Data Curation

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "nemo-curator" with this command: npx skills add zechenzhangagi/ai-research-skills/zechenzhangagi-ai-research-skills-nemo-curator

NeMo Curator - GPU-Accelerated Data Curation

NVIDIA's toolkit for preparing high-quality training data for LLMs.

When to use NeMo Curator

Use NeMo Curator when:

  • Preparing LLM training data from web scrapes (Common Crawl)

  • Need fast deduplication (16× faster than CPU)

  • Curating multi-modal datasets (text, images, video, audio)

  • Filtering low-quality or toxic content

  • Scaling data processing across GPU cluster

Performance:

  • 16× faster fuzzy deduplication (8TB RedPajama v2)

  • 40% lower TCO vs CPU alternatives

  • Near-linear scaling across GPU nodes

Use alternatives instead:

  • datatrove: CPU-based, open-source data processing

  • dolma: Allen AI's data toolkit

  • Ray Data: General ML data processing (no curation focus)

Quick start

Installation

Text curation (CUDA 12)

uv pip install "nemo-curator[text_cuda12]"

All modalities

uv pip install "nemo-curator[all_cuda12]"

CPU-only (slower)

uv pip install "nemo-curator[cpu]"

Basic text curation pipeline

from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd

Load data

df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df)

Quality filtering

def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs

filtered = ScoreFilter(quality_score)(dataset)

Deduplication

from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered)

Save

deduped.to_parquet("curated_data/")

Data curation pipeline

Stage 1: Quality filtering

from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter )

Apply 30+ heuristic filters

from nemo_curator import ScoreFilter

Word count filter

dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))

Remove repetitive content

dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))

URL ratio filter

dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))

Stage 2: Deduplication

Exact deduplication:

from nemo_curator.modules import ExactDuplicates

Remove exact duplicates

deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

Fuzzy deduplication (16× faster on GPU):

from nemo_curator.modules import FuzzyDuplicates

MinHash + LSH deduplication

fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" )

deduped = fuzzy_dedup(dataset)

Semantic deduplication:

from nemo_curator.modules import SemanticDuplicates

Embedding-based deduplication

semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold )

deduped = semantic_dedup(dataset)

Stage 3: PII redaction

from nemo_curator.modules import Modify from nemo_curator.modifiers import PIIRedactor

Redact personally identifiable information

pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" )

redacted = Modify(pii_redactor)(dataset)

Stage 4: Classifier filtering

from nemo_curator.classifiers import QualityClassifier

Quality classification

quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" )

Filter low-quality documents

high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)

GPU acceleration

GPU vs CPU performance

Operation CPU (16 cores) GPU (A100) Speedup

Fuzzy dedup (8TB) 120 hours 7.5 hours 16×

Exact dedup (1TB) 8 hours 0.5 hours 16×

Quality filtering 2 hours 0.2 hours 10×

Multi-GPU scaling

from nemo_curator import get_client import dask_cuda

Initialize GPU cluster

client = get_client(cluster_type="gpu", n_workers=8)

Process with 8 GPUs

deduped = FuzzyDuplicates(...)(dataset)

Multi-modal curation

Image curation

from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder )

Aesthetic scoring

aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset)

NSFW detection

nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images)

Generate CLIP embeddings

clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images)

Video curation

from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder )

Detect scenes

scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset)

Extract clips

clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes)

Generate embeddings

video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips)

Audio curation

from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter )

ASR transcription

asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset)

Filter by WER (word error rate)

wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed)

Duration filtering

duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio)

Common patterns

Web scrape curation (Common Crawl)

from nemo_curator import ScoreFilter, Modify from nemo_curator.filters import * from nemo_curator.modules import * from nemo_curator.datasets import DocumentDataset

Load Common Crawl data

dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")

Pipeline

pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3),

# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),

# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),

# 4. PII redaction
PIIRedactor(),

# 5. NSFW filtering
NSFWClassifier(threshold=0.8)

]

Execute

for stage in pipeline: dataset = stage(dataset)

Save

dataset.to_parquet("curated_common_crawl/")

Distributed processing

from nemo_curator import get_client from dask_cuda import LocalCUDACluster

Multi-GPU cluster

cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster)

Process large dataset

dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet") deduped = FuzzyDuplicates(...)(dataset)

Cleanup

client.close() cluster.close()

Performance benchmarks

Fuzzy deduplication (8TB RedPajama v2)

  • CPU (256 cores): 120 hours

  • GPU (8× A100): 7.5 hours

  • Speedup: 16×

Exact deduplication (1TB)

  • CPU (64 cores): 8 hours

  • GPU (4× A100): 0.5 hours

  • Speedup: 16×

Quality filtering (100GB)

  • CPU (32 cores): 2 hours

  • GPU (2× A100): 0.2 hours

  • Speedup: 10×

Cost comparison

CPU-based curation (AWS c5.18xlarge × 10):

  • Cost: $3.60/hour × 10 = $36/hour

  • Time for 8TB: 120 hours

  • Total: $4,320

GPU-based curation (AWS p4d.24xlarge × 2):

  • Cost: $32.77/hour × 2 = $65.54/hour

  • Time for 8TB: 7.5 hours

  • Total: $491.55

Savings: 89% reduction ($3,828 saved)

Supported data formats

  • Input: Parquet, JSONL, CSV

  • Output: Parquet (recommended), JSONL

  • WebDataset: TAR archives for multi-modal

Use cases

Production deployments:

  • NVIDIA used NeMo Curator to prepare Nemotron-4 training data

  • Open-source datasets curated: RedPajama v2, The Pile

References

  • Filtering Guide - 30+ quality filters, heuristics

  • Deduplication Guide - Exact, fuzzy, semantic methods

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

nemo-curator

No summary provided by upstream source.

Repository SourceNeeds Review
Research

qdrant-vector-search

No summary provided by upstream source.

Repository SourceNeeds Review
Research

crewai-multi-agent

No summary provided by upstream source.

Repository SourceNeeds Review