dataset-comparer

Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "dataset-comparer" with this command: npx skills add dkyazzentwatwa/chatgpt-skills/dkyazzentwatwa-chatgpt-skills-dataset-comparer

Dataset Comparer

Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.

Features

  • Row Comparison: Find added, removed, and matching rows
  • Value Changes: Detect changed values in matching rows
  • Column Comparison: Identify schema differences
  • Statistics: Summary of differences
  • Diff Reports: HTML, CSV, and JSON output
  • Flexible Matching: Compare by key columns or row position

Quick Start

from dataset_comparer import DatasetComparer

comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")

# Compare by key column
diff = comparer.compare(key_columns=["id"])

print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")

# Generate report
comparer.generate_report("diff_report.html")

CLI Usage

# Basic comparison
python dataset_comparer.py --old old.csv --new new.csv

# Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id

# Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date

# Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html

# Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv

# JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json

# Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date

# Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status

API Reference

DatasetComparer Class

class DatasetComparer:
    def __init__(self)

    # Data loading
    def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
    def load_dataframes(self, old_df: pd.DataFrame,
                       new_df: pd.DataFrame) -> 'DatasetComparer'

    # Comparison
    def compare(self, key_columns: list = None,
               ignore_columns: list = None,
               compare_columns: list = None) -> dict

    # Detailed results
    def get_added_rows(self) -> pd.DataFrame
    def get_removed_rows(self) -> pd.DataFrame
    def get_changed_rows(self) -> pd.DataFrame
    def get_unchanged_rows(self) -> pd.DataFrame

    # Schema comparison
    def compare_schema(self) -> dict

    # Reports
    def generate_report(self, output: str, format: str = "html") -> str
    def to_dataframe(self) -> pd.DataFrame
    def summary(self) -> str

Comparison Methods

Key-Based Comparison

Compare rows by matching key columns (like primary keys):

diff = comparer.compare(key_columns=["customer_id"])

# Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])

Position-Based Comparison

Compare rows by their position (row number):

diff = comparer.compare()  # No keys = positional comparison

Output Format

Comparison Result

{
    "summary": {
        "old_rows": 1000,
        "new_rows": 1050,
        "added_count": 75,
        "removed_count": 25,
        "changed_count": 50,
        "unchanged_count": 900,
        "total_differences": 150
    },
    "schema_changes": {
        "added_columns": ["new_field"],
        "removed_columns": ["old_field"],
        "type_changes": [
            {"column": "amount", "old_type": "int64", "new_type": "float64"}
        ]
    },
    "key_columns": ["id"],
    "compared_columns": ["name", "email", "status"],
    "ignored_columns": ["updated_at"]
}

Changed Row Details

changes = comparer.get_changed_rows()

# Returns DataFrame with columns:
# _key: Key value(s) for the row
# _column: Column that changed
# _old_value: Original value
# _new_value: New value

Schema Comparison

Compare column structure:

schema = comparer.compare_schema()

# Returns:
{
    "old_columns": ["id", "name", "old_field"],
    "new_columns": ["id", "name", "new_field"],
    "common_columns": ["id", "name"],
    "added_columns": ["new_field"],
    "removed_columns": ["old_field"],
    "type_changes": [
        {"column": "price", "old_type": "int64", "new_type": "float64"}
    ],
    "old_row_count": 1000,
    "new_row_count": 1050
}

Filtering Options

Ignore Columns

Skip certain columns during comparison:

diff = comparer.compare(
    key_columns=["id"],
    ignore_columns=["updated_at", "modified_by", "timestamp"]
)

Compare Specific Columns

Only compare selected columns:

diff = comparer.compare(
    key_columns=["id"],
    compare_columns=["name", "email", "status"]  # Only these columns
)

Report Formats

HTML Report

comparer.generate_report("diff_report.html", format="html")

Features:

  • Summary statistics
  • Interactive tables
  • Color-coded changes (green=added, red=removed, yellow=changed)
  • Schema comparison section

CSV Export

comparer.generate_report("diff_report.csv", format="csv")

Includes all differences in tabular format.

JSON Output

comparer.generate_report("diff_report.json", format="json")

Complete diff data in JSON format.

Example Workflows

Data Migration Validation

comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")

diff = comparer.compare(key_columns=["id"])

if diff["summary"]["total_differences"] == 0:
    print("Migration successful - no differences!")
else:
    print(f"Found {diff['summary']['total_differences']} differences")
    comparer.generate_report("migration_issues.html")

ETL Pipeline Verification

comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")

diff = comparer.compare(
    key_columns=["transaction_id"],
    ignore_columns=["processing_timestamp"]
)

# Check for unexpected changes
changed = comparer.get_changed_rows()
if len(changed) > 0:
    print("Warning: Historical records changed!")
    print(changed)

Incremental Update Detection

comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")

diff = comparer.compare(key_columns=["customer_id"])

# Get new records for processing
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")

# Get deleted records
deleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")

Schema Change Detection

comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")

schema = comparer.compare_schema()

if schema["added_columns"]:
    print(f"New columns: {schema['added_columns']}")

if schema["removed_columns"]:
    print(f"Removed columns: {schema['removed_columns']}")

if schema["type_changes"]:
    for change in schema["type_changes"]:
        print(f"Type change: {change['column']} "
              f"({change['old_type']} -> {change['new_type']})")

Large Dataset Tips

For very large datasets:

# Compare in chunks
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")

# Use key column for efficient matching
diff = comparer.compare(key_columns=["id"])

# Export only differences (not full data)
comparer.generate_report("diff_only.csv", format="csv")

Dependencies

  • pandas>=2.0.0
  • numpy>=1.24.0

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ocr-document-processor

No summary provided by upstream source.

Repository SourceNeeds Review
General

text-summarizer

No summary provided by upstream source.

Repository SourceNeeds Review
General

document-converter-suite

No summary provided by upstream source.

Repository SourceNeeds Review
General

financial-calculator

No summary provided by upstream source.

Repository SourceNeeds Review