NLP Pipeline Builder

Overview

Specialized ML pipelines for natural language processing. Handles text preprocessing, tokenization, transformer models (BERT, RoBERTa, GPT), fine-tuning, and deployment for production NLP systems.

NLP Tasks Supported

Text Classification

from specweave import NLPPipeline

Binary or multi-class text classification

pipeline = NLPPipeline( task="classification", classes=["positive", "negative", "neutral"], increment="0042" )

Automatically configures:

- Text preprocessing (lowercase, clean)

- Tokenization (BERT tokenizer)

- Model (BERT, RoBERTa, DistilBERT)

- Fine-tuning on your data

- Inference pipeline

pipeline.fit(train_texts, train_labels)

Named Entity Recognition (NER)

Extract entities from text

pipeline = NLPPipeline( task="ner", entities=["PERSON", "ORG", "LOC", "DATE"], increment="0042" )

Returns: [(entity_text, entity_type, start_pos, end_pos), ...]

Sentiment Analysis

Sentiment classification (specialized)

pipeline = NLPPipeline( task="sentiment", increment="0042" )

Fine-tuned for sentiment (positive/negative/neutral)

Text Generation

Generate text continuations

pipeline = NLPPipeline( task="generation", model="gpt2", increment="0042" )

Fine-tune on your domain-specific text

Best Practices for NLP

Text Preprocessing

from specweave import TextPreprocessor

preprocessor = TextPreprocessor(increment="0042")

Standard preprocessing

preprocessor.add_steps([ "lowercase", "remove_html", "remove_urls", "remove_emails", "remove_special_chars", "remove_extra_whitespace" ])

Advanced preprocessing

preprocessor.add_advanced([ "spell_correction", "lemmatization", "stopword_removal" ])

Model Selection

Text Classification:

Small datasets (<10K): DistilBERT (6x faster than BERT)
Medium datasets (10K-100K): BERT-base
Large datasets (>100K): RoBERTa-large

NER:

General: BERT + CRF layer
Domain-specific: Fine-tune BERT on domain corpus

Sentiment:

Product reviews: DistilBERT fine-tuned on Amazon reviews
Social media: RoBERTa fine-tuned on Twitter

Transfer Learning

Start from pre-trained language models

pipeline = NLPPipeline(task="classification")

Option 1: Use pre-trained (no fine-tuning)

pipeline.use_pretrained("distilbert-base-uncased")

Option 2: Fine-tune on your data

pipeline.use_pretrained_and_finetune( model="bert-base-uncased", epochs=3, learning_rate=2e-5 )

Handling Long Text

For text longer than 512 tokens

pipeline = NLPPipeline( task="classification", max_length=512, truncation_strategy="head_and_tail" # Keep start + end )

Or use Longformer for long documents

pipeline.use_model("longformer") # Handles 4096 tokens

Integration with SpecWeave

NLP increment structure

.specweave/increments/0042-sentiment-classifier/ ├── spec.md ├── data/ │ ├── train.csv │ ├── val.csv │ └── test.csv ├── models/ │ ├── tokenizer/ │ ├── model-epoch-1/ │ ├── model-epoch-2/ │ └── model-epoch-3/ ├── experiments/ │ ├── distilbert-baseline/ │ ├── bert-base-finetuned/ │ └── roberta-large/ └── deployment/ ├── model.onnx └── inference.py

Commands

/ml:nlp-pipeline --task classification --model bert-base /ml:nlp-evaluate 0042 # Evaluate on test set /ml:nlp-deploy 0042 # Export for production

Quick setup for NLP projects with state-of-the-art transformer models.

nlp-pipeline-builder

Safety Notice

Copy this and send it to your AI assistant to learn