Data Validation
Validate data quality, types, schemas, and consistency across systems. This skill helps ensure data accuracy, completeness, and reliability through comprehensive validation rules, schema validation, data quality checks, and anomaly detection.
When to use me
Use this skill when:
- You need to ensure data quality and consistency across systems
- You're building data pipelines and need validation rules
- You're migrating data between systems and need to validate accuracy
- You need to implement data quality monitoring and alerting
- You're designing data schemas and need validation rules
- You need to detect data anomalies and inconsistencies
- You're implementing data governance and quality standards
- You need to validate data against business rules and constraints
What I do
- Schema validation: Validate data against schemas (JSON Schema, Avro, Protobuf, XML Schema)
- Data type validation: Validate data types (strings, numbers, dates, booleans, etc.)
- Data quality checks: Check data completeness, accuracy, consistency, timeliness
- Business rule validation: Validate data against business rules and constraints
- Referential integrity validation: Validate relationships between data entities
- Anomaly detection: Detect data anomalies and outliers
- Data profiling: Profile data to understand characteristics and quality
- Data lineage validation: Validate data lineage and transformation accuracy
- Data consistency validation: Validate consistency across multiple data sources
- Data validation rule management: Manage and version validation rules
Examples
# Validate data against JSON schema
./scripts/analyze-data-validation.sh --schema schema.json --data data.json
# Perform data quality checks
./scripts/analyze-data-validation.sh --quality-checks --source database --rules rules.yaml
# Detect data anomalies
./scripts/analyze-data-validation.sh --anomaly-detection --data data.csv --threshold 3
# Validate data consistency across sources
./scripts/analyze-data-validation.sh --consistency --sources source1.csv source2.csv
# Generate data validation report
./scripts/analyze-data-validation.sh --report --output validation-report.json
Output format
Data Validation Report
─────────────────────────────────────
Validation Date: 2025-01-15T10:30:00Z
Data Source: customer_data.csv
Total Records: 1,250,847
Validation Duration: 45s
SCHEMA VALIDATION RESULTS:
───────────────────────────
✅ Valid: 1,200,543 records (96%)
❌ Invalid: 50,304 records (4%)
Schema Violations:
• Missing required field "customer_id": 12,847 records
• Invalid email format: 18,542 records
• Invalid date format (birth_date): 8,925 records
• Invalid phone number format: 10,090 records
DATA QUALITY METRICS:
──────────────────────
Completeness: 97.8%
• Customer name: 99.2% complete
• Email address: 96.5% complete
• Phone number: 95.8% complete
• Address: 92.4% complete
Accuracy: 95.3%
• Email deliverability: 94.8% valid
• Phone number validity: 96.2% valid
• Address validity: 95.0% valid
Consistency: 98.1%
• Date formats consistent: 99.2%
• Country codes consistent: 97.8%
• Currency formats consistent: 97.3%
Timeliness: 99.5%
• Data freshness: 99.8% updated within 24 hours
• Update frequency: 99.2% updated daily
BUSINESS RULE VALIDATION:
─────────────────────────
✅ Valid: 1,245,120 records (99.5%)
❌ Invalid: 5,727 records (0.5%)
Business Rule Violations:
• Customer age < 18: 1,247 records
• Invalid order amount (negative): 842 records
• Duplicate customer records: 2,150 records
• Inconsistent region-country mapping: 1,488 records
REFERENTIAL INTEGRITY VALIDATION:
─────────────────────────────────
✅ Valid: 1,248,950 references (99.8%)
❌ Invalid: 1,897 references (0.2%)
Referential Integrity Issues:
• Orphaned order records (missing customer): 892 references
• Invalid product references: 1,005 references
ANOMALY DETECTION:
──────────────────
⚠️ Detected: 3,842 anomalies (0.3%)
Anomaly Types:
• Unusual customer age distribution: 1,250 anomalies
• Abnormal order amounts: 1,892 anomalies
• Unexpected geographic distribution: 700 anomalies
DATA PROFILING SUMMARY:
───────────────────────
Numeric Fields:
• order_amount: Min=$1, Max=$15,250, Avg=$245.75, StdDev=$125.42
• customer_age: Min=18, Max=95, Avg=42.3, StdDev=12.8
Categorical Fields:
• country: 45 distinct values (US: 45%, UK: 12%, CA: 8%, ...)
• product_category: 12 distinct values (Electronics: 35%, Clothing: 25%, ...)
Date Fields:
• order_date: Range=2024-01-01 to 2025-01-15
• customer_since: Range=2018-03-15 to 2025-01-15
DATA CONSISTENCY ACROSS SOURCES:
────────────────────────────────
• Customer database vs CRM: 98.7% consistent
• Product catalog vs inventory: 97.2% consistent
• Order system vs payment gateway: 99.1% consistent
RECOMMENDATIONS:
────────────────
1. IMMEDIATE ACTION:
• Fix missing customer_id field (12,847 records)
• Clean invalid email addresses (18,542 records)
• Remove duplicate customer records (2,150 records)
2. SHORT TERM (1-2 weeks):
• Implement data validation in ingestion pipeline
• Set up data quality monitoring and alerting
• Establish data quality SLA (target: 99.5% quality)
3. MEDIUM TERM (1-3 months):
• Implement data profiling and anomaly detection
• Establish data governance framework
• Implement data lineage tracking
4. LONG TERM (3-12 months):
• Implement master data management
• Establish data quality improvement program
• Implement predictive data quality monitoring
VALIDATION RULE COVERAGE:
─────────────────────────
• Schema validation: 100% coverage
• Data type validation: 100% coverage
• Business rule validation: 85% coverage
• Referential integrity: 90% coverage
• Anomaly detection: 75% coverage
Overall Data Quality Score: 92.8/100
Notes
- Data validation should be implemented as close to data entry as possible
- Validation rules should be versioned and managed like code
- Consider performance implications of data validation, especially for large datasets
- Implement incremental validation for streaming data
- Use statistical methods for anomaly detection to reduce false positives
- Data quality metrics should be tracked over time to identify trends
- Validation failures should trigger appropriate remediation workflows
- Consider data privacy and security when validating sensitive data