Data Format Specialist
Quick Reference (30 seconds)
Advanced Data Format Management - Comprehensive data handling covering TOON encoding, JSON/YAML optimization, serialization patterns, and data validation for performance-critical applications.
Core Capabilities:
-
TOON Encoding: 40-60% token reduction vs JSON for LLM communication
-
JSON/YAML Optimization: Efficient serialization and parsing patterns
-
Data Validation: Schema validation, type checking, error handling
-
Format Conversion: Seamless transformation between data formats
-
Performance: Optimized data structures and caching strategies
-
Schema Management: Dynamic schema generation and evolution
When to Use:
-
Optimizing data transmission to LLMs within token budgets
-
High-performance serialization/deserialization
-
Schema validation and data integrity
-
Format conversion and data transformation
-
Large dataset processing and optimization
Quick Start:
TOON encoding (40-60% token reduction)
from moai_formats_data import TOONEncoder encoder = TOONEncoder() compressed = encoder.encode({"user": "John", "age": 30}) original = encoder.decode(compressed)
Fast JSON processing
from moai_formats_data import JSONOptimizer optimizer = JSONOptimizer() fast_json = optimizer.serialize_fast(large_dataset)
Data validation
from moai_formats_data import DataValidator validator = DataValidator() schema = validator.create_schema({"name": {"type": "string", "required": True}}) result = validator.validate({"name": "John"}, schema)
Implementation Guide (5 minutes)
Core Concepts
TOON (Token-Optimized Object Notation):
-
Custom binary-compatible format optimized for LLM token usage
-
Type markers: # (numbers), ! (booleans), @ (timestamps), ~ (null)
-
40-60% size reduction vs JSON for typical data structures
-
Lossless round-trip encoding/decoding
Performance Optimization:
-
Ultra-fast JSON processing with orjson (2-5x faster than standard json)
-
Streaming processing for large datasets using ijson
-
Intelligent caching with LRU eviction and memory management
-
Schema compression and validation optimization
Data Validation:
-
Type-safe validation with custom rules and patterns
-
Schema evolution and migration support
-
Cross-field validation and dependency checking
-
Performance-optimized batch validation
Basic Implementation
from moai_formats_data import TOONEncoder, JSONOptimizer, DataValidator from datetime import datetime
1. TOON Encoding for LLM optimization
encoder = TOONEncoder() data = { "user": {"id": 123, "name": "John", "active": True, "created": datetime.now()}, "permissions": ["read", "write", "admin"] }
Encode and compare sizes
toon_data = encoder.encode(data) original_data = encoder.decode(toon_data)
2. Fast JSON Processing
optimizer = JSONOptimizer()
Ultra-fast serialization
json_bytes = optimizer.serialize_fast(data) parsed_data = optimizer.deserialize_fast(json_bytes)
Schema compression for repeated validation
schema = {"type": "object", "properties": {"name": {"type": "string"}}} compressed_schema = optimizer.compress_schema(schema)
3. Data Validation
validator = DataValidator()
Create validation schema
user_schema = validator.create_schema({ "username": {"type": "string", "required": True, "min_length": 3}, "email": {"type": "email", "required": True}, "age": {"type": "integer", "required": False, "min_value": 13} })
Validate data
user_data = {"username": "john_doe", "email": "john@example.com", "age": 30} result = validator.validate(user_data, user_schema)
if result['valid']: print("Data is valid!") sanitized = result['sanitized_data'] else: print("Validation errors:", result['errors'])
Common Use Cases
API Response Optimization:
Optimize API responses for LLM consumption
def optimize_api_response(data: Dict) -> str: encoder = TOONEncoder() return encoder.encode(data)
Parse optimized responses
def parse_optimized_response(toon_data: str) -> Dict: encoder = TOONEncoder() return encoder.decode(toon_data)
Configuration Management:
Fast YAML configuration loading
from moai_formats_data import YAMLOptimizer
yaml_optimizer = YAMLOptimizer() config = yaml_optimizer.load_fast("config.yaml")
Merge multiple configurations
merged = yaml_optimizer.merge_configs(base_config, env_config, user_config)
Large Dataset Processing:
Stream processing for large JSON files
from moai_formats_data import StreamProcessor
processor = StreamProcessor(chunk_size=8192)
Process file line by line without loading into memory
def process_item(item): print(f"Processing: {item['id']}")
processor.process_json_stream("large_dataset.json", process_item)
Advanced Features (10+ minutes)
Advanced TOON Features
Custom Type Handlers:
Extend TOON encoder with custom types
class CustomTOONEncoder(TOONEncoder): def _encode_value(self, value):
Handle UUID objects
if hasattr(value, 'hex') and len(value.hex) == 32: # UUID return f'${value.hex}'
Handle Decimal objects
elif hasattr(value, 'as_tuple'): # Decimal return f'&{str(value)}'
return super()._encode_value(value)
def _parse_value(self, s):
Parse custom UUIDs
if s.startswith('$') and len(s) == 33: import uuid return uuid.UUID(s[1:])
Parse custom Decimals
elif s.startswith('&'): from decimal import Decimal return Decimal(s[1:])
return super()._parse_value(s)
Streaming TOON Processing:
Process TOON data in streaming mode
def stream_toon_data(data_generator): encoder = TOONEncoder() for data in data_generator: yield encoder.encode(data)
Batch TOON processing
def batch_encode_toon(data_list: List[Dict], batch_size: int = 1000): encoder = TOONEncoder() results = []
for i in range(0, len(data_list), batch_size): batch = data_list[i:i + batch_size] encoded_batch = [encoder.encode(item) for item in batch] results.extend(encoded_batch)
return results
Advanced Validation Patterns
Cross-Field Validation:
Validate relationships between fields
class CrossFieldValidator: def init(self): self.base_validator = DataValidator()
def validate_user_data(self, data: Dict) -> Dict:
Base validation
schema = self.base_validator.create_schema({ "password": {"type": "string", "required": True, "min_length": 8}, "confirm_password": {"type": "string", "required": True}, "email": {"type": "email", "required": True} })
result = self.base_validator.validate(data, schema)
Cross-field validation
if data.get("password") != data.get("confirm_password"): result['errors']['password_mismatch'] = "Passwords do not match" result['valid'] = False
return result
Schema Evolution:
Handle schema changes over time
from moai_formats_data import SchemaEvolution
evolution = SchemaEvolution()
Define schema versions
v1_schema = {"name": {"type": "string"}, "age": {"type": "integer"}} v2_schema = {"full_name": {"type": "string"}, "age": {"type": "integer"}, "email": {"type": "email"}}
Register schemas
evolution.register_schema("v1", v1_schema) evolution.register_schema("v2", v2_schema)
Add migration function
def migrate_v1_to_v2(data: Dict) -> Dict: return { "full_name": data["name"], "age": data["age"], "email": None # New required field }
evolution.add_migration("v1", "v2", migrate_v1_to_v2)
Migrate data
old_data = {"name": "John Doe", "age": 30} new_data = evolution.migrate_data(old_data, "v1", "v2")
Performance Optimization
Intelligent Caching:
from moai_formats_data import SmartCache
Create cache with memory constraints
cache = SmartCache(max_memory_mb=50, max_items=10000)
@cache.cache.cache_result(ttl=1800) # 30 minutes def expensive_data_processing(data: Dict) -> Dict:
Simulate expensive computation
time.sleep(0.1) return {"processed": True, "data": data}
Cache statistics
print(cache.get_stats())
Cache warming
def warm_common_data(): common_queries = [ {"type": "user", "id": 1}, {"type": "user", "id": 2}, {"type": "config", "key": "app"} ]
for query in common_queries: expensive_data_processing(query)
warm_common_data()
Batch Processing Optimization:
Optimized batch validation
def validate_batch_optimized(data_list: List[Dict], schema: Dict) -> List[Dict]: validator = DataValidator()
Pre-compile patterns for performance
validator._compile_schema_patterns(schema)
Process in batches for memory efficiency
batch_size = 1000 results = []
for i in range(0, len(data_list), batch_size): batch = data_list[i:i + batch_size] batch_results = [validator.validate(data, schema) for data in batch] results.extend(batch_results)
return results
Integration Patterns
LLM Integration:
Prepare data for LLM consumption
def prepare_for_llm(data: Dict, max_tokens: int = 2000) -> str: encoder = TOONEncoder() toon_data = encoder.encode(data)
Check token count
estimated_tokens = len(toon_data.split())
if estimated_tokens > max_tokens:
Implement data reduction strategy
reduced_data = reduce_data_complexity(data, max_tokens) toon_data = encoder.encode(reduced_data)
return toon_data
def reduce_data_complexity(data: Dict, max_tokens: int) -> Dict: """Reduce data complexity to fit token budget."""
Implement selective field removal
priority_fields = ["id", "name", "email", "status"] reduced = {k: v for k, v in data.items() if k in priority_fields}
Further reduction if needed
encoder = TOONEncoder() while len(encoder.encode(reduced).split()) > max_tokens:
Remove least important fields
if len(reduced) <= 1: break reduced.popitem()
return reduced
Database Integration:
Optimize database queries with format conversion
def optimize_db_response(db_data: List[Dict]) -> Dict:
Convert database results to optimized format
optimizer = JSONOptimizer()
Compress and cache schema
common_schema = {"type": "object", "properties": {"id": {"type": "integer"}}} compressed_schema = optimizer.compress_schema(common_schema)
Process in batches
processor = StreamProcessor() processed_data = []
for item in db_data:
Apply validation and transformation
processed_item = transform_db_item(item) processed_data.append(processed_item)
return { "data": processed_data, "count": len(processed_data), "schema": compressed_schema }
Works Well With
-
moai-domain-backend - Backend data serialization and API responses
-
moai-domain-database - Database data format optimization
-
moai-integration-mcp - MCP data serialization and transmission
-
moai-docs-generation - Documentation data formatting
-
moai-foundation-core - Core data architecture principles
Module References
Core Implementation Modules:
-
modules/toon-encoding.md
-
TOON encoding implementation and examples
-
modules/json-optimization.md
-
High-performance JSON/YAML processing
-
modules/data-validation.md
-
Advanced validation and schema management
-
modules/caching-performance.md
-
Caching strategies and performance optimization
Usage Examples
CLI Usage
Encode data to TOON format
moai-formats encode-toon --input data.json --output data.toon
Validate data against schema
moai-formats validate --schema schema.json --data data.json
Convert between formats
moai-formats convert --input data.json --output data.yaml --format yaml
Optimize JSON structure
moai-formats optimize-json --input large-data.json --output optimized.json
Python API
from moai_formats_data import TOONEncoder, DataValidator, JSONOptimizer
TOON encoding
encoder = TOONEncoder() toon_data = encoder.encode({"user": "John", "age": 30}) original_data = encoder.decode(toon_data)
Data validation
validator = DataValidator() schema = validator.create_schema({ "name": {"type": "string", "required": True, "min_length": 2}, "email": {"type": "email", "required": True} }) result = validator.validate({"name": "John", "email": "john@example.com"}, schema)
JSON optimization
optimizer = JSONOptimizer() fast_json = optimizer.serialize_fast(large_dataset) parsed_data = optimizer.deserialize_fast(fast_json)
Technology Stack
Core Libraries:
-
orjson: Ultra-fast JSON parsing and serialization
-
PyYAML: YAML processing with C-based loaders
-
ijson: Streaming JSON parser for large files
-
python-dateutil: Advanced datetime parsing
-
regex: Advanced regular expression support
Performance Tools:
-
lru_cache: Built-in memoization
-
pickle: Object serialization
-
hashlib: Hash generation for caching
-
functools: Function decorators and utilities
Validation Libraries:
-
jsonschema: JSON Schema validation
-
cerberus: Lightweight data validation
-
marshmallow: Object serialization/deserialization
-
pydantic: Data validation using Python type hints
Status: Production Ready Last Updated: 2025-11-30 Maintained by: MoAI-ADK Data Team