data-quality-analysis

Data Quality Analysis Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-quality-analysis" with this command: npx skills add scrivo21/sasamclaudecodeskills/scrivo21-sasamclaudecodeskills-data-quality-analysis

Data Quality Analysis Skill

Analyse raw data against the Australian Bureau of Statistics (ABS) Data Quality Framework and produce a structured quality report. Based on ABS Catalogue No. 1520.0 — ABS Data Quality Framework, May 2009.

Overview

This skill assesses data quality across the seven dimensions of the ABS Data Quality Framework:

  • Institutional Environment — Who collected the data and under what authority?

  • Relevance — Does the data measure what we need it to measure?

  • Timeliness — How current is the data?

  • Accuracy — How well does the data describe reality?

  • Coherence — Is the data internally consistent and comparable with other sources?

  • Interpretability — Can users understand and correctly use the data?

  • Accessibility — Can users find and access the data in a suitable format?

The output is a standalone markdown report that can be used directly or populated into a formal quality declaration document.

Discovery Process (CRITICAL)

Before analysing any dataset, you MUST conduct a discovery interview to understand the data and its context.

Questions to Ask

Data Source & Ownership

  • What is the dataset? (file path, URL, database table, API endpoint)

  • Who collected or produced this data?

  • Under what authority or mandate was it collected? (legislation, contract, internal policy)

  • Is there existing documentation or a data dictionary?

Purpose & Intended Use

  • What will this data be used for? (analysis, reporting, integration, modelling, decision-making)

  • Who are the intended users of this data?

  • What decisions will be informed by this data?

  • Are there specific questions this data needs to answer?

Data Context

  • What is the reference period? (when does the data relate to?)

  • When was the data collected or extracted?

  • How frequently is this data updated?

  • Has the data been through any prior cleaning or transformation?

Known Issues

  • Are there any known data quality concerns?

  • Have there been changes in collection methodology over time?

  • Are there known gaps or exclusions in the population/sample?

Quality Expectations

  • What level of quality is required for the intended use?

  • Are there specific dimensions of quality that are more critical than others?

  • Is there a downstream formal report or quality declaration this will feed into?

If the User Provides a File Directly

If the user provides a data file without context, proceed with what can be determined from the data itself and clearly note assumptions and unknowns in the report. Mark dimensions that cannot be assessed without external context as "Unable to assess — requires additional context" and list the specific information needed.

Analysis Workflow

Phase 1: Data Ingestion & Profiling

Load and inspect the data to establish a foundational understanding:

  • Read the data — Load the file (CSV, Excel, JSON, Parquet, database query result, etc.)

  • Schema inspection — Document column names, data types, and structure

  • Row/record counts — Total records, unique records, duplicate detection

  • Descriptive statistics — For numeric fields: min, max, mean, median, std dev, quartiles

  • Value distributions — For categorical fields: unique values, frequency counts, top values

  • Missing data analysis — Null/blank counts and percentages per column

  • Sample records — Display first few rows and any anomalous rows

Phase 1b: Column-Level Quality Scorecard

After initial profiling, compute a per-column quality scorecard. This quantifies six quality metrics for every column, producing a single table that gives the user an immediate, measurable view of data quality before the narrative dimension assessments.

The Six Column Metrics

Metric Definition How to Calculate Scale

Completeness Proportion of non-null, non-blank values (total - nulls - blanks) / total * 100

0–100%

Validity Proportion of values that conform to the expected domain (type, range, format) valid_values / non_null_values * 100

0–100%

Consistency Proportion of values that match the dominant format/pattern within the column values_matching_dominant_pattern / non_null_values * 100

0–100%

Uniqueness Proportion of distinct values relative to total non-null values unique_values / non_null_values * 100

0–100%

Timeliness For date/time columns: recency score based on how current the most recent value is. For non-date columns: mark as N/A See timeliness calculation below 0–100% or N/A

Accuracy Proportion of values that are plausible (not impossible, not extreme outliers) (non_null - impossible - outliers) / non_null * 100

0–100%

Metric Calculation Details

Completeness:

completeness = (len(df) - df[col].isna().sum() - (df[col] == '').sum()) / len(df) * 100

  • Count both NaN /None AND empty strings/whitespace-only values as incomplete

  • For numeric columns, also count sentinel values (e.g., -999, 9999) if identified as placeholders

Validity: Assess whether values conform to the expected domain for the column type:

  • Numeric columns: values within a plausible range (not negative where impossible, within min/max bounds if known)

  • Date columns: valid date format, not in the future (unless expected), not before a reasonable minimum

  • Categorical columns: values belong to the expected set of categories (if known), or are not obviously erroneous

  • String columns: values match expected format (e.g., email regex, phone pattern, postcode format)

  • Identifier columns: values match the expected pattern (e.g., WO-YYYY-NNNNN )

Example for a numeric column that should be positive

valid = df[col].dropna().between(0, upper_bound).sum() validity = valid / df[col].notna().sum() * 100

Consistency: Detect the dominant format/pattern and measure adherence:

  • Date columns: what percentage use the same date format? (e.g., YYYY-MM-DD vs DD/MM/YYYY vs mixed)

  • Categorical columns: case consistency (e.g., "Active" vs "active" vs "ACTIVE"), trailing spaces, encoding issues

  • Numeric columns: consistent precision/decimal places, consistent units

  • String columns: consistent casing convention, consistent delimiters

Example: check case consistency for a categorical column

values = df[col].dropna() dominant_case = values.apply(str.title) # or str.lower, str.upper consistency = (values == dominant_case).sum() / len(values) * 100

Uniqueness:

uniqueness = df[col].nunique() / df[col].notna().sum() * 100

  • For identifier/key columns, uniqueness should be 100% — flag if not

  • For categorical columns, low uniqueness is expected and normal — interpret in context

  • This metric is informational; low uniqueness is not inherently bad

Timeliness: Only applicable to date/datetime columns. Calculate as a recency score:

For date columns: how recent is the most recent value?

if col is date type: max_date = df[col].max() days_since = (assessment_date - max_date).days # Score: 100% if within 30 days, decaying linearly to 0% at 365+ days timeliness = max(0, 100 - (days_since / 365 * 100)) else: timeliness = "N/A"

Accuracy: Combine impossible value detection and outlier detection:

non_null = df[col].notna().sum() impossible = count_impossible_values(df[col]) # domain-specific rules outliers = count_outliers_iqr(df[col]) # IQR method for numeric accuracy = (non_null - impossible - outliers) / non_null * 100

Impossible value rules (apply based on column semantics):

  • Ages: negative or > 150

  • Dates: before 1900 or after assessment date (unless future dates are valid)

  • Percentages: < 0 or > 100

  • Costs/prices: negative (unless refunds are valid)

  • Hours: negative or exceeding plausible maximums

  • Counts: negative or non-integer

Column Quality Scorecard Table

Present the results in this format:

ColumnCompletenessValidityConsistencyUniquenessTimelinessAccuracyIssues
column_name98.5%99.2%100.0%45.3%N/A97.8%23 outliers
  • Use colour-coded indicators in the Issues column to highlight problems

  • Bold any metric below the thresholds defined below

Quality Thresholds

Metric HIGH (Green) ADEQUATE (Amber) LOW (Red)

Completeness

= 95% 80–94% < 80%

Validity

= 98% 90–97% < 90%

Consistency

= 95% 80–94% < 80%

Uniqueness Context-dependent Context-dependent Context-dependent

Timeliness

= 80% 50–79% < 50%

Accuracy

= 95% 85–94% < 85%

Note on Uniqueness: Uniqueness thresholds depend on the column's role:

  • Primary key / identifier: should be 100% — anything less indicates duplicates

  • Categorical / classification: low uniqueness is expected (e.g., 5 asset classes across 12,000 rows = 0.04%)

  • Free text / measurement: moderate to high uniqueness is normal

Column Quality Summary Score

After computing all six metrics per column, derive a summary score for each column:

Column Score = weighted average of applicable metrics

Default weights (adjustable based on user priorities):

  • Completeness: 30%

  • Validity: 25%

  • Accuracy: 25%

  • Consistency: 15%

  • Uniqueness: 5% (only for identifier columns, otherwise 0% and redistribute)

  • Timeliness: 0% for non-date columns (redistribute to others)

Present an overall dataset quality score as the average of all column scores.

Python Code for Column Scorecard

import pandas as pd import numpy as np

def column_quality_scorecard(df, date_cols=None, id_cols=None, assessment_date=None): """Compute per-column quality metrics for a DataFrame.""" if assessment_date is None: assessment_date = pd.Timestamp.now() if date_cols is None: date_cols = [] if id_cols is None: id_cols = []

results = []

for col in df.columns:
    n = len(df)
    non_null = df[col].notna().sum()
    nulls = df[col].isna().sum()

    # Completeness
    blanks = 0
    if df[col].dtype == 'object':
        blanks = df[col].fillna('').apply(lambda x: str(x).strip() == '').sum() - nulls
        blanks = max(blanks, 0)
    completeness = (n - nulls - blanks) / n * 100 if n > 0 else 0

    # Uniqueness
    uniqueness = df[col].nunique() / non_null * 100 if non_null > 0 else 0

    # Validity, Consistency, Accuracy — compute per data type
    validity = np.nan
    consistency = np.nan
    accuracy = np.nan
    timeliness = "N/A"
    issues = []

    if pd.api.types.is_numeric_dtype(df[col]):
        vals = df[col].dropna()
        # Validity: check for negative values in columns that should be positive
        # (apply domain-specific rules as needed)
        validity = 100.0  # default; override with domain rules

        # Consistency: check precision consistency
        if len(vals) > 0:
            decimals = vals.apply(lambda x: len(str(x).split('.')[-1]) if '.' in str(x) else 0)
            dominant = decimals.mode().iloc[0] if len(decimals.mode()) > 0 else 0
            consistency = (decimals == dominant).sum() / len(vals) * 100

        # Accuracy: IQR outlier detection
        if len(vals) > 0:
            q1, q3 = vals.quantile(0.25), vals.quantile(0.75)
            iqr = q3 - q1
            outlier_count = ((vals &#x3C; q1 - 1.5 * iqr) | (vals > q3 + 1.5 * iqr)).sum()
            negative_count = (vals &#x3C; 0).sum()
            accuracy = (len(vals) - outlier_count) / len(vals) * 100
            if outlier_count > 0:
                issues.append(f"{outlier_count} outliers")
            if negative_count > 0:
                issues.append(f"{negative_count} negatives")

    elif col in date_cols or pd.api.types.is_datetime64_any_dtype(df[col]):
        vals = pd.to_datetime(df[col], errors='coerce').dropna()
        valid_dates = len(vals)
        attempted = df[col].notna().sum()
        validity = valid_dates / attempted * 100 if attempted > 0 else 0
        consistency = 100.0  # after parsing; pre-parse format consistency requires string analysis

        # Timeliness
        if len(vals) > 0:
            days_since = (assessment_date - vals.max()).days
            timeliness = f"{max(0, 100 - (days_since / 365 * 100)):.0f}%"

        # Accuracy: future dates, implausible past dates
        future = (vals > assessment_date).sum()
        accuracy = (len(vals) - future) / len(vals) * 100 if len(vals) > 0 else 0
        if future > 0:
            issues.append(f"{future} future dates")

    elif df[col].dtype == 'object':
        vals = df[col].dropna()
        if len(vals) > 0:
            # Validity: non-empty strings
            valid = vals.apply(lambda x: str(x).strip() != '').sum()
            validity = valid / len(vals) * 100

            # Consistency: case consistency check
            title_match = (vals == vals.str.title()).sum()
            lower_match = (vals == vals.str.lower()).sum()
            upper_match = (vals == vals.str.upper()).sum()
            best_match = max(title_match, lower_match, upper_match)
            consistency = best_match / len(vals) * 100

            accuracy = 100.0  # default for strings; override with domain rules

    # Identifier uniqueness check
    if col in id_cols and uniqueness &#x3C; 100:
        issues.append(f"Duplicate IDs ({100 - uniqueness:.1f}%)")

    if nulls > 0:
        issues.append(f"{nulls} nulls ({nulls/n*100:.1f}%)")

    results.append({
        'Column': col,
        'Completeness': f"{completeness:.1f}%",
        'Validity': f"{validity:.1f}%" if not np.isnan(validity) else "—",
        'Consistency': f"{consistency:.1f}%" if not np.isnan(consistency) else "—",
        'Uniqueness': f"{uniqueness:.1f}%",
        'Timeliness': timeliness,
        'Accuracy': f"{accuracy:.1f}%" if not np.isnan(accuracy) else "—",
        'Issues': "; ".join(issues) if issues else "—"
    })

return pd.DataFrame(results)

Phase 2: Dimension-by-Dimension Assessment

Assess each of the seven ABS DQF dimensions systematically. For each dimension, provide:

  • Assessment — A clear narrative assessment

  • Evidence — Specific findings from the data that support the assessment

  • Rating — One of: HIGH | ADEQUATE | LOW | UNABLE TO ASSESS

  • Risks — Identified risks if this dimension is weak

  • Recommendations — Specific actions to improve quality in this dimension

The Seven Dimensions — Assessment Guide

Dimension 1: Institutional Environment

The institutional and organisational factors that influence the credibility and trustworthiness of the data.

Assess the following (where information is available):

Sub-element What to Assess

Mandate for data collection Is there a legal, regulatory, or organisational mandate for collecting this data? What legislation, policy, or contract governs the collection?

Adequacy of resources Were sufficient resources (staff, systems, budget) available to collect and maintain the data to the required standard?

Quality commitment Are there documented quality assurance processes? Is there evidence of data validation, review, or quality control during collection?

Statistical confidentiality Are there privacy or confidentiality considerations? Has personally identifiable information (PII) been appropriately handled? Is the data de-identified where required?

What to look for in the data itself:

  • Metadata headers or source attribution within the file

  • Evidence of data governance (version numbers, timestamps, author fields)

  • PII exposure (names, addresses, email addresses, phone numbers, tax file numbers, etc.)

  • Signs of automated vs manual data entry (consistency patterns)

Rating guidance:

  • HIGH — Clear provenance, authoritative source, documented governance, no PII concerns

  • ADEQUATE — Source is known but governance documentation is incomplete

  • LOW — Unknown source, no governance evidence, PII exposed, or questionable authority

  • UNABLE TO ASSESS — Insufficient context provided

Dimension 2: Relevance

How well the data meets the needs of users in terms of concepts measured and populations represented.

Assess the following:

Sub-element What to Assess

Scope and coverage Does the data cover the target population? Who or what is included and excluded? Are exclusions likely to cause bias?

Concepts and classifications Do the variables/fields measure the right concepts? Are classifications and categories appropriate for the intended use?

Reference period Does the reference period align with the user's analytical needs?

User needs alignment Does the data contain the fields necessary to answer the user's questions? Are there critical gaps?

What to look for in the data itself:

  • Column names and whether they map to the concepts the user needs

  • Coverage gaps — are expected categories, time periods, or geographic regions missing?

  • Proxy measures — are any fields indirect measures of what is actually needed?

  • Granularity — is the data at the right level of detail (e.g., individual vs aggregated)?

Rating guidance:

  • HIGH — Data directly measures required concepts, covers target population, correct reference period

  • ADEQUATE — Mostly relevant but with minor gaps in coverage or concepts

  • LOW — Significant misalignment between data content and user needs

  • UNABLE TO ASSESS — User needs not yet defined

Dimension 3: Timeliness

The delay between the reference period and data availability, and the currency of the data.

Assess the following:

Sub-element What to Assess

Reference period What period does the data relate to?

Collection/extraction date When was the data actually collected or extracted?

Lag What is the delay between the reference period and the data being made available?

Update frequency How often is the data refreshed? Is this sufficient for the intended use?

Currency for intended use Is the data current enough for the decisions it will inform?

What to look for in the data itself:

  • Date/timestamp columns — most recent and oldest values

  • File metadata — creation date, last modified date

  • Gaps in time series — missing periods that suggest delays or interruptions

  • Stale records — entries that appear outdated relative to the reference period

Rating guidance:

  • HIGH — Data is current, lag is minimal and acceptable for intended use

  • ADEQUATE — Some delay but still within acceptable bounds for the use case

  • LOW — Significant lag, data may be too outdated for reliable analysis

  • UNABLE TO ASSESS — Reference period or extraction dates unknown

Dimension 4: Accuracy

How well the data correctly describes the phenomena it was designed to measure.

This is typically the most data-intensive dimension to assess. Evaluate:

Sub-element What to Assess

Sampling error If the data is a sample, what is the sampling method? What is the margin of error? Is the sample size adequate?

Non-sampling error Errors from collection, processing, or coverage — not related to sampling

Coverage error Are units in the target population missing from or incorrectly included in the data?

Non-response error What proportion of expected responses are missing? Is there a pattern to non-response (systematic bias)?

Response/measurement error Are there signs of incorrect values from respondents or measurement instruments?

Processing error Are there signs of errors introduced during data entry, coding, editing, or transformation?

What to look for in the data itself:

  • Missing values — Count and percentage per column; patterns in missingness (MCAR, MAR, MNAR)

  • Outliers — Values that fall outside expected ranges (use IQR method, z-scores, or domain knowledge)

  • Impossible values — Negative ages, future dates, percentages > 100%, etc.

  • Duplicates — Exact duplicate rows or duplicate keys

  • Internal consistency — Do related fields agree? (e.g., start date before end date, totals matching sum of parts)

  • Data type violations — Numeric fields containing text, date fields with inconsistent formats

  • Truncation/rounding — Evidence of precision loss

  • Default/sentinel values — Suspicious repeated values (e.g., 9999, 0, "N/A", "TBD")

Rating guidance:

  • HIGH — Low missing data (<5%), no impossible values, few outliers, internally consistent

  • ADEQUATE — Some missing data (5-15%), minor inconsistencies, outliers present but explainable

  • LOW — High missing data (>15%), impossible values found, significant inconsistencies, unexplained outliers

  • UNABLE TO ASSESS — Insufficient domain knowledge to evaluate accuracy

Dimension 5: Coherence

The internal consistency of the data and its comparability with other sources over time.

Assess the following:

Sub-element What to Assess

Internal coherence Are aggregations consistent? Do sub-totals sum to totals? Are cross-tabulations consistent?

Temporal coherence Is the data consistent over time? Have definitions, classifications, or methods changed between periods?

Cross-source coherence Does this data align with other related datasets or published statistics?

Standards alignment Does the data use recognised standards, classifications, or coding schemes? (e.g., ANZSIC, ANZSCO, ISO country codes)

What to look for in the data itself:

  • Columns that should sum to a total — do they?

  • Categories that change names or codes across records or time periods

  • Inconsistent units of measurement within the same field

  • Fields that contradict each other within the same record

  • Comparison with known benchmarks (e.g., ABS published totals, industry standards)

Rating guidance:

  • HIGH — Internally consistent, uses recognised standards, aligns with external sources

  • ADEQUATE — Minor inconsistencies, mostly uses standards but with some deviations

  • LOW — Significant internal inconsistencies, no standards used, contradicts external sources

  • UNABLE TO ASSESS — No external benchmarks available for comparison

Dimension 6: Interpretability

The availability of information needed to understand and correctly use the data.

Assess the following:

Sub-element What to Assess

Concepts and definitions Are all variables clearly defined? Is there a data dictionary?

Classifications and coding Are classification schemes documented? Are codes explained?

Collection methodology Is the data collection method documented? (survey instrument, administrative form, sensor, etc.)

Data processing Are transformations, derivations, and cleaning steps documented?

Known limitations Are caveats, limitations, and appropriate use cases documented?

What to look for in the data itself:

  • Column naming clarity — are names self-explanatory or cryptic?

  • Coded values without a lookup table (e.g., "1", "2", "3" with no label)

  • Units of measurement — are they specified? (kg, tonnes, AUD, USD, metres)

  • Presence of a header row, metadata rows, or embedded documentation

  • Consistency of naming conventions across columns

Rating guidance:

  • HIGH — Comprehensive data dictionary, clear naming, documented methodology and limitations

  • ADEQUATE — Partial documentation, mostly understandable but some fields need clarification

  • LOW — No documentation, cryptic column names, unexplained codes, unknown methodology

  • UNABLE TO ASSESS — Cannot determine without additional metadata

Dimension 7: Accessibility

The ease with which users can find and access the data in a suitable format.

Assess the following:

Sub-element What to Assess

Discoverability Can users easily find that this data exists? Is it catalogued or registered?

Format suitability Is the data in a machine-readable, open format? (CSV, JSON, Parquet vs proprietary formats)

Access conditions Are there restrictions on access? (authentication, licensing, cost, approvals)

Technical accessibility Can the data be easily loaded into common tools? (file size, encoding, structure)

Documentation accessibility Is supporting documentation co-located with the data?

What to look for in the data itself:

  • File format — open standard or proprietary?

  • File size — manageable or requires special handling?

  • Encoding — UTF-8 or problematic encoding?

  • Structure — well-structured tabular data or requires significant parsing?

  • Embedded formatting issues — merged cells in Excel, inconsistent delimiters in CSV

Rating guidance:

  • HIGH — Open format, easily loadable, well-structured, documentation available alongside data

  • ADEQUATE — Accessible format but minor structural issues, documentation exists but is separate

  • LOW — Proprietary format, difficult to parse, poorly structured, no documentation

  • UNABLE TO ASSESS — Data access method unknown

Report Generation

After completing the assessment, generate a standalone HTML report using the template in references/report-template.html . The report uses SAS-AM branding with light/dark mode, colour-coded scorecard cells, a sidebar navigation, and a print-friendly layout.

Output Format

The primary output is a single HTML file (data-quality-report.html ) that:

  • Can be opened directly in any browser

  • Uses CDN fonts (Source Sans Pro, Source Code Pro) and icons (Font Awesome) — no build step

  • Includes light/dark mode toggle with localStorage persistence

  • Has a print button for PDF export via browser print dialog

  • Uses the SAS-AM colour system (SAS Blue #002244 / SAS Green #69BE28)

  • Colour-codes scorecard cells: green (HIGH), amber (ADEQUATE), red (LOW)

Report Structure

The HTML report follows this layout:

  • Fixed Header — Report title, theme toggle, print button

  • Sidebar Navigation — Jump links to all sections and dimensions

  • Report Title Section — Title, subtitle, metadata (dataset, source, date, intended use)

  • Executive Summary — Overall score ring, rating badge, dimension summary table

  • Data Profile — Stat cards (records, columns, duplicates, etc.) and schema table

  • Column-Level Quality Scorecard — Per-column metrics table with colour-coded cells, column summary scores, overall dataset score

  • Dimension Assessments (x7) — Cards with assessment, evidence, risks, recommendations

  • Recommendations — Grouped as Critical / Important / Minor with numbered items

  • Appendix — Profiling code block and ABS framework reference

Overall Quality Rating

Derive an overall quality rating based on the seven dimension ratings:

Overall Rating Criteria

HIGH All dimensions rated HIGH or ADEQUATE, with no more than one ADEQUATE

ADEQUATE Majority of dimensions rated ADEQUATE or above, no more than one LOW

LOW Two or more dimensions rated LOW

CONDITIONAL Quality is acceptable for some uses but not others — specify conditions

HTML Rating Classes

Use these CSS classes for rating badges and scorecard cells:

Rating Badge Class Cell Class

HIGH rating rating-high

cell-high

ADEQUATE rating rating-adequate

cell-adequate

LOW rating rating-low

cell-low

UNABLE TO ASSESS rating rating-unable

cell-na

Score Ring Calculation

The overall score ring in the executive summary uses an SVG circle with stroke-dashoffset :

circumference = 2 * π * 42 = 264 offset = circumference * (1 - score / 100)

For example, a score of 96.4% → offset = 264 * (1 - 0.964) = 9.5

Set stroke colour based on rating: var(--rating-high) for HIGH, var(--rating-adequate) for ADEQUATE, etc.

Workflow Summary

Step 1: Discovery

Conduct the discovery interview to understand:

  • What the data is and where it comes from

  • What the user intends to do with it

  • What quality concerns exist

  • What documentation is available

Step 2: Data Profiling & Column Scorecard

Load and profile the data:

  • Inspect schema and data types

  • Calculate descriptive statistics

  • Analyse missing data patterns

  • Detect outliers and anomalies

  • Check for duplicates

  • Validate data types and formats

  • Compute the column-level quality scorecard (completeness, validity, consistency, uniqueness, timeliness, accuracy per column)

  • Calculate column summary scores and overall dataset quality score

Use Python (pandas, numpy) or the appropriate tool for the data format. Write analysis code in a script file or notebook for reproducibility.

Step 3: Dimension Assessment

Work through each of the seven dimensions systematically:

  • Combine data profiling results with contextual information from discovery

  • Assign a rating for each dimension

  • Document evidence, risks, and recommendations

Step 4: Report Generation

Generate the HTML report:

  • Copy the report template from references/report-template.html

  • Replace all {{PLACEHOLDER}} values with findings

  • Populate scorecard table rows with colour-coded CSS classes (cell-high , cell-adequate , cell-low )

  • Populate dimension cards with assessment, evidence, risks, and recommendations

  • Populate recommendation items grouped by severity

  • Calculate the score ring offset: 264 * (1 - score / 100)

  • Save as data-quality-report.html (or a user-specified filename)

Step 5: Review & Handover

  • Present the report to the user

  • Highlight critical findings and risks

  • Discuss recommendations and next steps

  • Clarify if the report will feed into a formal quality declaration

Technical Notes

Supported Data Formats

The skill should handle any data format the user provides:

  • Tabular: CSV, TSV, Excel (.xlsx, .xls), Parquet, Feather, ORC

  • Structured: JSON, JSONL, XML

  • Database: SQL query results, SQLite files

  • Other: Fixed-width files, SAS datasets (.sas7bdat), Stata (.dta), SPSS (.sav)

Python Libraries for Profiling

When writing profiling code, prefer these libraries:

import pandas as pd import numpy as np

Load data

df = pd.read_csv("data.csv") # or appropriate reader

Schema

print(df.dtypes) print(df.shape)

Descriptive statistics

print(df.describe(include='all'))

Missing data

missing = df.isnull().sum() missing_pct = (missing / len(df) * 100).round(2) print(pd.DataFrame({'missing_count': missing, 'missing_pct': missing_pct}))

Duplicates

print(f"Duplicate rows: {df.duplicated().sum()}")

Value distributions for categorical columns

for col in df.select_dtypes(include='object').columns: print(f"\n{col}:") print(df[col].value_counts().head(10))

Handling Large Datasets

For datasets too large to load fully into memory:

  • Use chunked reading (pd.read_csv(..., chunksize=10000) )

  • Profile a representative sample first

  • Note in the report that profiling was done on a sample and state the sample size

Content Guidelines

Australian English

Use Australian English spelling throughout:

  • analyse (not analyze)

  • organisation (not organization)

  • colour (not color)

  • programme (not program, unless referring to software)

  • licence (noun) / license (verb)

  • centre (not center)

Tone

  • Professional and objective

  • Evidence-based — every finding must be supported by data

  • Actionable — recommendations should be specific and practical

  • Proportionate — focus attention on the dimensions that matter most for the intended use

Ratings

  • Be honest about quality — do not inflate ratings

  • Use "UNABLE TO ASSESS" when genuine uncertainty exists rather than guessing

  • Always explain the rationale for each rating

  • Consider the intended use when assessing fitness for purpose

Framework Reference

Source: Australian Bureau of Statistics, ABS Data Quality Framework, Catalogue No. 1520.0, May 2009.

The ABS DQF was developed based on:

  • Statistics Canada Quality Assurance Framework

  • European Statistics Code of Practice

Key principle: Quality is fitness for purpose. The same dataset may be high quality for one use and low quality for another. Always assess quality relative to the intended use.

Trade-offs: There are inherent trade-offs between dimensions. For example:

  • Timeliness vs Accuracy — releasing data faster may reduce time for quality checks

  • Accuracy vs Accessibility — more detailed data may be harder to access due to confidentiality

  • Relevance vs Coherence — tailoring data to specific needs may reduce comparability

Checklist

Before delivering the report, verify:

  • Discovery interview completed (or assumptions clearly documented)

  • Data successfully loaded and profiled

  • Column-level quality scorecard computed with all six metrics

  • Column summary scores and overall dataset quality score calculated

  • All seven dimensions assessed with ratings

  • Evidence provided for each assessment

  • Risks identified for each dimension

  • Recommendations are specific and actionable

  • Rating summary table included in executive summary

  • Overall quality rating derived and justified

  • Australian English spelling used throughout

  • Report saved as standalone HTML file

  • Light/dark mode toggle works correctly

  • Scorecard cells are colour-coded (green/amber/red)

  • Score ring displays correct percentage and offset

  • Sidebar navigation links to all sections

  • Print layout renders cleanly (test with Ctrl+P)

  • Any analysis code saved for reproducibility

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

sas-presentation

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

sasam-update

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

HTML to Markdown

Convert HTML↔Markdown for web clipping, clean notes, and batch content migration. Use when users ask 网页转Markdown/Markdown转HTML/批量转换. Supports local files, UR...

Registry SourceRecently Updated
Coding

Podfetcher Tools

Search podcasts, browse episodes, and fetch podcast transcripts from Podfetcher using the bundled Node.js CLI, SDK, or MCP server.

Registry SourceRecently Updated