AI Training Data Generation
Overview
A comprehensive skill for automatically generating high-quality training datasets from documents, text corpora, and structured content. Optimized for low-resource languages, dictionary content, and domain-specific knowledge extraction.
Key Sources: Dictionary databases, Bible EPUBs (NWT), JW brochures, parallel text corpora
Capabilities
-
Multi-strategy Generation: Dictionary pairs, contextual definitions, completion tasks, classification examples
-
Quality Filtering: Confidence scoring, duplicate removal, and content validation
-
Format Flexibility: Support for multiple AI training formats (JSONL, HuggingFace, Ollama, OpenAI)
-
Language Awareness: Multi-language support with special handling for accented characters
-
Scalable Processing: Generate thousands of examples from large documents
-
Balance Management: Ensure dataset diversity and prevent category imbalance
-
EPUB Processing: Extract parallel verses from Bible EPUBs for translation training
-
Sentence Alignment: Align parallel sentences from bilingual documents
Core Strategies
- Dictionary Pair Extraction
Extract word-definition pairs from structured and semi-structured text.
Detection Patterns:
-
Separator-based: word – definition , term: meaning
-
Linguistic indicators: means , is defined as , refers to
-
Structural cues: Indentation, formatting, list structures
-
Context analysis: Surrounding text for validation
- Bible EPUB Parallel Corpus Extraction
Extract aligned verse pairs from Chuukese and English NWT Bible EPUBs.
from ebooklib import epub from src.utils.nwt_epub_parser import NWTEpubParser
class BibleTrainingDataExtractor: """ Extract parallel training data from NWT Bible EPUBs. Aligns verses between Chuukese (nwt_TE.epub) and English (nwt_E.epub). """
def __init__(self, chuukese_epub_path: str, english_epub_path: str):
self.chk_parser = NWTEpubParser(chuukese_epub_path)
self.en_parser = NWTEpubParser(english_epub_path)
def extract_parallel_verses(
self,
books: list = None,
min_verse_length: int = 10
) -> list:
"""
Extract aligned verse pairs for training.
Args:
books: List of book names to extract (None = all)
min_verse_length: Minimum characters per verse
Returns:
List of {'chuukese': str, 'english': str, 'reference': str}
"""
parallel_pairs = []
# Get list of books
if books is None:
books = self.chk_parser.get_book_list()
for book in books:
# Get chapters
chapters = self.chk_parser.get_chapters(book)
for chapter in chapters:
# Get verses from both languages
chk_verses = self.chk_parser.get_verses(book, chapter)
en_verses = self.en_parser.get_verses(book, chapter)
# Align by verse number
for verse_num, chk_text in chk_verses.items():
if verse_num in en_verses:
en_text = en_verses[verse_num]
# Filter by length
if len(chk_text) >= min_verse_length and len(en_text) >= min_verse_length:
parallel_pairs.append({
'chuukese': chk_text.strip(),
'english': en_text.strip(),
'reference': f"{book} {chapter}:{verse_num}",
'source': 'bible',
'confidence': 0.95 # Bible translations are high-quality
})
return parallel_pairs
def export_for_helsinki_training(
self,
output_path: str,
direction: str = "chk_to_en"
) -> dict:
"""
Export parallel data in format suitable for Helsinki-NLP fine-tuning.
Args:
output_path: Path for output files
direction: 'chk_to_en' or 'en_to_chk'
Returns:
Statistics about exported data
"""
pairs = self.extract_parallel_verses()
# Split into train/val/test (80/10/10)
from sklearn.model_selection import train_test_split
train_pairs, test_pairs = train_test_split(pairs, test_size=0.2, random_state=42)
val_pairs, test_pairs = train_test_split(test_pairs, test_size=0.5, random_state=42)
# Export in TSV format for transformers
for split_name, split_data in [
('train', train_pairs),
('val', val_pairs),
('test', test_pairs)
]:
with open(f"{output_path}/{split_name}.tsv", 'w', encoding='utf-8') as f:
for pair in split_data:
if direction == "chk_to_en":
f.write(f"{pair['chuukese']}\t{pair['english']}\n")
else:
f.write(f"{pair['english']}\t{pair['chuukese']}\n")
return {
'total_pairs': len(pairs),
'train_size': len(train_pairs),
'val_size': len(val_pairs),
'test_size': len(test_pairs)
}
3. Brochure Sentence Extraction
Extract parallel sentences from JW brochures.
class BrochureSentenceExtractor: """Extract training pairs from brochure sentences."""
def __init__(self, sentences_json_path: str):
with open(sentences_json_path, 'r', encoding='utf-8') as f:
self.sentences = json.load(f)
def extract_training_pairs(self) -> list:
"""Extract and format brochure sentences for training."""
pairs = []
for item in self.sentences:
if 'chuukese' in item and 'english' in item:
pairs.append({
'chuukese': item['chuukese'],
'english': item['english'],
'source': 'brochure',
'confidence': item.get('confidence', 0.8)
})
return pairs
4. Implementation Pattern
from .ai_training_generator import AITrainingDataGenerator
Initialize generator
generator = AITrainingDataGenerator(min_confidence=0.7)
Generate comprehensive training data
training_data = generator.generate_comprehensive_training_data( parsed_document, target_count=10000 )
Export in multiple formats
files = generator.export_training_data( training_data, output_dir="training_output", format_type="ollama" )
Output Format Examples
JSONL Format (Standard)
{"input": "What does 'ááfengen' mean?", "output": "very good, excellent", "type": "dictionary_pair", "confidence": 0.95}
Ollama Format
{"prompt": "Translate this Chuukese word: ngang", "response": "fish", "system": "You are a Chuukese-English translator."}
HuggingFace Format
{"text": "### Instruction:\nWhat does 'chomong' mean in Chuukese?\n\n### Response:\nto help, assist"}
OpenAI Fine-tuning Format
{"messages": [{"role": "user", "content": "Define: kúún"}, {"role": "assistant", "content": "to go, to leave"}]}
Quality Assurance
-
Content validity: Does the example make linguistic sense?
-
Pattern matching: Does it follow expected language patterns?
-
Context appropriateness: Is the context relevant and helpful?
-
Uniqueness: Avoid repetitive or duplicate content
Best Practices
-
Multiple validation passes: Automated and manual quality checks
-
Confidence thresholds: Adjust based on use case requirements
-
Human review sampling: Periodic manual validation of generated examples
-
Balance management: Ensure even distribution across categories
Dependencies
-
re : Regular expression pattern matching
-
json : Data serialization and export
-
hashlib : Duplicate detection and content hashing
-
collections : Data structure utilities and counting