csv-data-auditor

When auditing CSV data, perform these validation checks systematically:

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "csv-data-auditor" with this command: npx skills add aig787/agpm/aig787-agpm-csv-data-auditor

CSV Data Auditor

Instructions

When auditing CSV data, perform these validation checks systematically:

  1. File Structure Validation
  • Verify the file exists and is readable

  • Check if the file is a valid CSV format

  • Ensure consistent delimiter usage

  • Verify proper quoting and escaping

  • Check for balanced quotes

  1. Header Analysis
  • Confirm headers exist (unless headerless CSV)

  • Check for duplicate header names

  • Validate header naming conventions

  • Ensure headers are descriptive and consistent

  • Check for special characters in headers

  1. Row Consistency
  • Count total rows and columns

  • Verify all rows have the same number of columns

  • Check for empty rows or rows with only whitespace

  • Identify truncated rows

  • Detect corrupted or malformed rows

  1. Data Type Validation

For each column, validate the expected data type:

Numeric Fields

  • Check for non-numeric values

  • Validate decimal point usage

  • Look for negative values where inappropriate

  • Check for extremely large/small values

Date/Time Fields

  • Verify date format consistency (ISO 8601, US, EU, etc.)

  • Check for invalid dates (e.g., February 30th)

  • Validate time zone handling

  • Look for future dates where inappropriate

Text Fields

  • Check for encoding issues

  • Look for unexpected special characters

  • Verify string length constraints

  • Check for leading/trailing whitespace

Boolean Fields

  • Validate true/false representations

  • Check for 1/0, Y/N, Yes/No consistency

  • Look for ambiguous values

  1. Data Quality Checks

Missing Values

  • Count NULL/empty values per column

  • Calculate missing value percentages

  • Identify patterns in missing data

  • Check for placeholders like "N/A", "null", "-"

Duplicates

  • Find exact duplicate rows

  • Check for duplicate IDs or keys

  • Identify near-duplicates (typos, variations)

  • Validate uniqueness constraints

Outliers

  • Identify statistical outliers in numeric columns

  • Check for values outside expected ranges

  • Look for anomalies in categorical data

  • Validate business logic constraints

Consistency Checks

  • Cross-validate related columns

  • Check referential integrity

  • Validate business rules

  • Look for logical contradictions

  1. Validation Script (Python)

Create a Python script to perform automated checks:

import pandas as pd import numpy as np from datetime import datetime

def audit_csv(file_path, delimiter=','): """Perform comprehensive CSV audit"""

# Load CSV
try:
    df = pd.read_csv(file_path, delimiter=delimiter)
except Exception as e:
    return {"error": f"Failed to load CSV: {str(e)}"}

audit_report = {
    "file_info": {
        "rows": len(df),
        "columns": len(df.columns),
        "headers": list(df.columns)
    },
    "issues": []
}

# Check for missing values
missing_counts = df.isnull().sum()
for col, count in missing_counts.items():
    if count > 0:
        percentage = (count / len(df)) * 100
        audit_report["issues"].append({
            "type": "missing_values",
            "column": col,
            "count": count,
            "percentage": round(percentage, 2)
        })

# Check for duplicates
duplicate_rows = df.duplicated().sum()
if duplicate_rows > 0:
    audit_report["issues"].append({
        "type": "duplicates",
        "count": duplicate_rows
    })

# Data type validation
for col in df.columns:
    # Check for mixed types in object columns
    if df[col].dtype == 'object':
        sample_values = df[col].dropna().head(10)
        # Add specific type checks based on column name patterns

return audit_report

7. Report Format

Generate a structured audit report:

CSV Audit Report

File Information

  • File: data.csv
  • Size: 15.2 MB
  • Rows: 50,000
  • Columns: 12
  • Headers: id, name, email, age, join_date, salary, department, ...

Issues Found

Critical Issues

  1. Missing Values:
    • email: 2,500 missing (5.0%)
    • salary: 150 missing (0.3%)

Warnings

  1. Inconsistent Date Format:

    • join_date: Mix of ISO and US formats detected
    • Examples: 2023-01-15, 01/15/2023, 15-Jan-2023
  2. Potential Outliers:

    • age: Values 0 and 150 detected
    • salary: Extremely high values > $1M

Recommendations

  1. Clean up email field - contact data source
  2. Standardize date format to ISO 8601
  3. Validate age and salary ranges
  4. Remove or investigate duplicate rows

Summary

  • Overall Quality: Good
  • Issues to Fix: 3 critical, 5 warnings
  • Estimated Fix Time: 2-3 hours
  1. Common Issues and Solutions

Encoding Problems

  • Try different encodings (UTF-8, Latin-1, Windows-1252)

  • Use chardet library to detect encoding

  • Handle BOM (Byte Order Mark) if present

Large Files

  • Process in chunks for memory efficiency

  • Use Dask for out-of-core processing

  • Sample data for quick validation

Complex CSVs

  • Handle quoted fields with embedded delimiters

  • Process multi-line records carefully

  • Validate escape character usage

Usage

  • Provide the CSV file path

  • Specify delimiter if not comma

  • Configure validation rules based on your data

  • Review the generated audit report

  • Address issues systematically

Tips

  • Always keep a backup of original data

  • Document any data transformations

  • Create validation rules based on business requirements

  • Consider using schema validation tools like Great Expectations

  • Automate regular audits for production data

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

code-reviewer

No summary provided by upstream source.

Repository SourceNeeds Review
General

commit-message-generator

No summary provided by upstream source.

Repository SourceNeeds Review
General

pdf-processor

No summary provided by upstream source.

Repository SourceNeeds Review
Security

Certcheck

SSL/TLS certificate checker and analyzer. Inspect SSL certificates for any domain, check expiration dates, verify certificate chain, detect security issues,...

Registry SourceRecently Updated
1080Profile unavailable