json-data-validation-test-design

JSON Data File Validation Test Design

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "json-data-validation-test-design" with this command: npx skills add shimo4228/claude-code-learned-skills/shimo4228-claude-code-learned-skills-json-data-validation-test-design

JSON Data File Validation Test Design

Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules

Problem

JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:

  • Stray characters from OCR/copy-paste (e.g., ß mixed into Japanese text)

  • Schema violations that the app silently swallows

  • Cross-reference mismatches (source data vs generated output)

  • Missing or duplicate entries

  • Business rule violations (e.g., correct answer not in choices)

Manual review of large files (60+ entries, 3000+ lines) is unreliable.

Solution: Layered Pytest Validation

Structure tests in layers from structural to semantic:

Layer 1: Top-level structure

class TestTopLevelStructure: def test_required_fields(self, data): ... def test_count_matches(self, data): assert data["totalItems"] == len(data["items"])

Layer 2: Per-entry schema validation

class TestEntryFields: def test_required_fields(self, entries): for e in entries: missing = REQUIRED - e.keys() assert not missing, f"Entry {e['id']}: missing {missing}"

def test_enum_values(self, entries):
    for e in entries:
        assert e["type"] in VALID_TYPES

Layer 3: Cross-entry consistency

class TestConsistency: def test_no_duplicates(self, entries): ids = [e["id"] for e in entries] assert len(ids) == len(set(ids))

def test_references_resolve(self, entries, categories):
    # Every entry's category exists in categories list

Layer 4: Source cross-reference

class TestSourceCrossReference: @pytest.fixture def source_data(self): # Parse original source files ...

def test_values_match_source(self, entries, source_data):
    mismatches = []
    for e in entries:
        if e["answer"] != source_data[e["id"]]:
            mismatches.append(...)
    assert not mismatches, f"{len(mismatches)} mismatches"

Layer 5: Content quality heuristics

class TestContentQuality: def test_min_text_length(self, entries): for e in entries: assert len(e["text"]) >= THRESHOLD

def test_no_stray_characters(self, entries):
    stray = {"ß", "€", "£"}  # Characters unlikely in this domain
    issues = []
    for e in entries:
        for ch in stray:
            if ch in e["text"]:
                issues.append(f"{e['id']}: '{ch}'")
    assert not issues

Key Design Decisions

  • Module-scoped fixtures for the parsed JSON (scope="module" ) to avoid re-reading per test

  • Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems

  • Graceful degradation: source cross-reference tests skip with pytest.skip() if source files are absent

  • Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like "過学習")

When to Use

  • After generating/rebuilding JSON data files from external sources

  • As a CI gate for data files that feed into apps

  • When a data file is too large for manual review

  • When data is parsed from inconsistent sources (OCR, PDF export, manual entry)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

content-hash-cache-pattern

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

cross-source-fact-verification

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

long-document-llm-pipeline

No summary provided by upstream source.

Repository SourceNeeds Review