Data Quality Analysis Skill

Analyse raw data against the Australian Bureau of Statistics (ABS) Data Quality Framework and produce a structured quality report. Based on ABS Catalogue No. 1520.0 — ABS Data Quality Framework, May 2009.

Overview

This skill assesses data quality across the seven dimensions of the ABS Data Quality Framework:

Institutional Environment — Who collected the data and under what authority?
Relevance — Does the data measure what we need it to measure?
Timeliness — How current is the data?
Accuracy — How well does the data describe reality?
Coherence — Is the data internally consistent and comparable with other sources?
Interpretability — Can users understand and correctly use the data?
Accessibility — Can users find and access the data in a suitable format?

The output is a standalone markdown report that can be used directly or populated into a formal quality declaration document.

Discovery Process (CRITICAL)

Before analysing any dataset, you MUST conduct a discovery interview to understand the data and its context.

Questions to Ask

Data Source & Ownership

What is the dataset? (file path, URL, database table, API endpoint)
Who collected or produced this data?
Under what authority or mandate was it collected? (legislation, contract, internal policy)
Is there existing documentation or a data dictionary?

Purpose & Intended Use

What will this data be used for? (analysis, reporting, integration, modelling, decision-making)
Who are the intended users of this data?
What decisions will be informed by this data?
Are there specific questions this data needs to answer?

Data Context

What is the reference period? (when does the data relate to?)
When was the data collected or extracted?
How frequently is this data updated?
Has the data been through any prior cleaning or transformation?

Known Issues

Are there any known data quality concerns?
Have there been changes in collection methodology over time?
Are there known gaps or exclusions in the population/sample?

Quality Expectations

What level of quality is required for the intended use?
Are there specific dimensions of quality that are more critical than others?
Is there a downstream formal report or quality declaration this will feed into?

If the User Provides a File Directly

If the user provides a data file without context, proceed with what can be determined from the data itself and clearly note assumptions and unknowns in the report. Mark dimensions that cannot be assessed without external context as "Unable to assess — requires additional context" and list the specific information needed.

Analysis Workflow

Phase 1: Data Ingestion & Profiling

Load and inspect the data to establish a foundational understanding:

Read the data — Load the file (CSV, Excel, JSON, Parquet, database query result, etc.)
Schema inspection — Document column names, data types, and structure
Row/record counts — Total records, unique records, duplicate detection
Descriptive statistics — For numeric fields: min, max, mean, median, std dev, quartiles
Value distributions — For categorical fields: unique values, frequency counts, top values
Missing data analysis — Null/blank counts and percentages per column
Sample records — Display first few rows and any anomalous rows

Phase 1b: Column-Level Quality Scorecard

After initial profiling, compute a per-column quality scorecard. This quantifies six quality metrics for every column, producing a single table that gives the user an immediate, measurable view of data quality before the narrative dimension assessments.

The Six Column Metrics

Metric Definition How to Calculate Scale

Completeness Proportion of non-null, non-blank values (total - nulls - blanks) / total * 100

0–100%

Validity Proportion of values that conform to the expected domain (type, range, format) valid_values / non_null_values * 100

0–100%

Consistency Proportion of values that match the dominant format/pattern within the column values_matching_dominant_pattern / non_null_values * 100

0–100%

Uniqueness Proportion of distinct values relative to total non-null values unique_values / non_null_values * 100

0–100%

Timeliness For date/time columns: recency score based on how current the most recent value is. For non-date columns: mark as N/A See timeliness calculation below 0–100% or N/A

Accuracy Proportion of values that are plausible (not impossible, not extreme outliers) (non_null - impossible - outliers) / non_null * 100

0–100%

Metric Calculation Details

Completeness:

completeness = (len(df) - df[col].isna().sum() - (df[col] == '').sum()) / len(df) * 100

Count both NaN /None AND empty strings/whitespace-only values as incomplete
For numeric columns, also count sentinel values (e.g., -999, 9999) if identified as placeholders

Validity: Assess whether values conform to the expected domain for the column type:

Numeric columns: values within a plausible range (not negative where impossible, within min/max bounds if known)
Date columns: valid date format, not in the future (unless expected), not before a reasonable minimum
Categorical columns: values belong to the expected set of categories (if known), or are not obviously erroneous
String columns: values match expected format (e.g., email regex, phone pattern, postcode format)
Identifier columns: values match the expected pattern (e.g., WO-YYYY-NNNNN )

Example for a numeric column that should be positive

valid = df[col].dropna().between(0, upper_bound).sum() validity = valid / df[col].notna().sum() * 100

Consistency: Detect the dominant format/pattern and measure adherence:

Date columns: what percentage use the same date format? (e.g., YYYY-MM-DD vs DD/MM/YYYY vs mixed)
Categorical columns: case consistency (e.g., "Active" vs "active" vs "ACTIVE"), trailing spaces, encoding issues
Numeric columns: consistent precision/decimal places, consistent units
String columns: consistent casing convention, consistent delimiters

Example: check case consistency for a categorical column

values = df[col].dropna() dominant_case = values.apply(str.title) # or str.lower, str.upper consistency = (values == dominant_case).sum() / len(values) * 100

Uniqueness:

uniqueness = df[col].nunique() / df[col].notna().sum() * 100

For identifier/key columns, uniqueness should be 100% — flag if not
For categorical columns, low uniqueness is expected and normal — interpret in context
This metric is informational; low uniqueness is not inherently bad

Timeliness: Only applicable to date/datetime columns. Calculate as a recency score:

For date columns: how recent is the most recent value?

if col is date type: max_date = df[col].max() days_since = (assessment_date - max_date).days # Score: 100% if within 30 days, decaying linearly to 0% at 365+ days timeliness = max(0, 100 - (days_since / 365 * 100)) else: timeliness = "N/A"

Accuracy: Combine impossible value detection and outlier detection:

non_null = df[col].notna().sum() impossible = count_impossible_values(df[col]) # domain-specific rules outliers = count_outliers_iqr(df[col]) # IQR method for numeric accuracy = (non_null - impossible - outliers) / non_null * 100

Impossible value rules (apply based on column semantics):

Ages: negative or > 150
Dates: before 1900 or after assessment date (unless future dates are valid)
Percentages: < 0 or > 100
Costs/prices: negative (unless refunds are valid)
Hours: negative or exceeding plausible maximums
Counts: negative or non-integer

Column Quality Scorecard Table

Present the results in this format:

Column	Completeness	Validity	Consistency	Uniqueness	Timeliness	Accuracy	Issues
column_name	98.5%	99.2%	100.0%	45.3%	N/A	97.8%	23 outliers

Use colour-coded indicators in the Issues column to highlight problems
Bold any metric below the thresholds defined below

Quality Thresholds

Metric HIGH (Green) ADEQUATE (Amber) LOW (Red)

Completeness

= 95% 80–94% < 80%

Validity

= 98% 90–97% < 90%

Consistency

= 95% 80–94% < 80%

Uniqueness Context-dependent Context-dependent Context-dependent

Timeliness

= 80% 50–79% < 50%

Accuracy

= 95% 85–94% < 85%

Note on Uniqueness: Uniqueness thresholds depend on the column's role:

Primary key / identifier: should be 100% — anything less indicates duplicates
Categorical / classification: low uniqueness is expected (e.g., 5 asset classes across 12,000 rows = 0.04%)
Free text / measurement: moderate to high uniqueness is normal

Column Quality Summary Score

After computing all six metrics per column, derive a summary score for each column:

Column Score = weighted average of applicable metrics

Default weights (adjustable based on user priorities):

Completeness: 30%
Validity: 25%
Accuracy: 25%
Consistency: 15%
Uniqueness: 5% (only for identifier columns, otherwise 0% and redistribute)
Timeliness: 0% for non-date columns (redistribute to others)

Present an overall dataset quality score as the average of all column scores.

Python Code for Column Scorecard

import pandas as pd import numpy as np

def column_quality_scorecard(df, date_cols=None, id_cols=None, assessment_date=None): """Compute per-column quality metrics for a DataFrame.""" if assessment_date is None: assessment_date = pd.Timestamp.now() if date_cols is None: date_cols = [] if id_cols is None: id_cols = []

results = []

for col in df.columns:
    n = len(df)
    non_null = df[col].notna().sum()
    nulls = df[col].isna().sum()

    # Completeness
    blanks = 0
    if df[col].dtype == 'object':
        blanks = df[col].fillna('').apply(lambda x: str(x).strip() == '').sum() - nulls
        blanks = max(blanks, 0)
    completeness = (n - nulls - blanks) / n * 100 if n > 0 else 0

    # Uniqueness
    uniqueness = df[col].nunique() / non_null * 100 if non_null > 0 else 0

    # Validity, Consistency, Accuracy — compute per data type
    validity = np.nan
    consistency = np.nan
    accuracy = np.nan
    timeliness = "N/A"
    issues = []

    if pd.api.types.is_numeric_dtype(df[col]):
        vals = df[col].dropna()
        # Validity: check for negative values in columns that should be positive
        # (apply domain-specific rules as needed)
        validity = 100.0  # default; override with domain rules

        # Consistency: check precision consistency
        if len(vals) > 0:
            decimals = vals.apply(lambda x: len(str(x).split('.')[-1]) if '.' in str(x) else 0)
            dominant = decimals.mode().iloc[0] if len(decimals.mode()) > 0 else 0
            consistency = (decimals == dominant).sum() / len(vals) * 100

        # Accuracy: IQR outlier detection
        if len(vals) > 0:
            q1, q3 = vals.quantile(0.25), vals.quantile(0.75)
            iqr = q3 - q1
            outlier_count = ((vals &#x3C; q1 - 1.5 * iqr) | (vals > q3 + 1.5 * iqr)).sum()
            negative_count = (vals &#x3C; 0).sum()
            accuracy = (len(vals) - outlier_count) / len(vals) * 100
            if outlier_count > 0:
                issues.append(f"{outlier_count} outliers")
            if negative_count > 0:
                issues.append(f"{negative_count} negatives")

    elif col in date_cols or pd.api.types.is_datetime64_any_dtype(df[col]):
        vals = pd.to_datetime(df[col], errors='coerce').dropna()
        valid_dates = len(vals)
        attempted = df[col].notna().sum()
        validity = valid_dates / attempted * 100 if attempted > 0 else 0
        consistency = 100.0  # after parsing; pre-parse format consistency requires string analysis

        # Timeliness
        if len(vals) > 0:
            days_since = (assessment_date - vals.max()).days
            timeliness = f"{max(0, 100 - (days_since / 365 * 100)):.0f}%"

        # Accuracy: future dates, implausible past dates
        future = (vals > assessment_date).sum()
        accuracy = (len(vals) - future) / len(vals) * 100 if len(vals) > 0 else 0
        if future > 0:
            issues.append(f"{future} future dates")

    elif df[col].dtype == 'object':
        vals = df[col].dropna()
        if len(vals) > 0:
            # Validity: non-empty strings
            valid = vals.apply(lambda x: str(x).strip() != '').sum()
            validity = valid / len(vals) * 100

            # Consistency: case consistency check
            title_match = (vals == vals.str.title()).sum()
            lower_match = (vals == vals.str.lower()).sum()
            upper_match = (vals == vals.str.upper()).sum()
            best_match = max(title_match, lower_match, upper_match)
            consistency = best_match / len(vals) * 100

            accuracy = 100.0  # default for strings; override with domain rules

    # Identifier uniqueness check
    if col in id_cols and uniqueness &#x3C; 100:
        issues.append(f"Duplicate IDs ({100 - uniqueness:.1f}%)")

    if nulls > 0:
        issues.append(f"{nulls} nulls ({nulls/n*100:.1f}%)")

    results.append({
        'Column': col,
        'Completeness': f"{completeness:.1f}%",
        'Validity': f"{validity:.1f}%" if not np.isnan(validity) else "—",
        'Consistency': f"{consistency:.1f}%" if not np.isnan(consistency) else "—",
        'Uniqueness': f"{uniqueness:.1f}%",
        'Timeliness': timeliness,
        'Accuracy': f"{accuracy:.1f}%" if not np.isnan(accuracy) else "—",
        'Issues': "; ".join(issues) if issues else "—"
    })

return pd.DataFrame(results)

Phase 2: Dimension-by-Dimension Assessment

Assess each of the seven ABS DQF dimensions systematically. For each dimension, provide:

Assessment — A clear narrative assessment
Evidence — Specific findings from the data that support the assessment
Rating — One of: HIGH | ADEQUATE | LOW | UNABLE TO ASSESS
Risks — Identified risks if this dimension is weak
Recommendations — Specific actions to improve quality in this dimension

The Seven Dimensions — Assessment Guide

Dimension 1: Institutional Environment

The institutional and organisational factors that influence the credibility and trustworthiness of the data.

Assess the following (where information is available):

Sub-element What to Assess

Mandate for data collection Is there a legal, regulatory, or organisational mandate for collecting this data? What legislation, policy, or contract governs the collection?

Adequacy of resources Were sufficient resources (staff, systems, budget) available to collect and maintain the data to the required standard?

Quality commitment Are there documented quality assurance processes? Is there evidence of data validation, review, or quality control during collection?

Statistical confidentiality Are there privacy or confidentiality considerations? Has personally identifiable information (PII) been appropriately handled? Is the data de-identified where required?

What to look for in the data itself:

Metadata headers or source attribution within the file
Evidence of data governance (version numbers, timestamps, author fields)
PII exposure (names, addresses, email addresses, phone numbers, tax file numbers, etc.)
Signs of automated vs manual data entry (consistency patterns)

Rating guidance:

HIGH — Clear provenance, authoritative source, documented governance, no PII concerns
ADEQUATE — Source is known but governance documentation is incomplete
LOW — Unknown source, no governance evidence, PII exposed, or questionable authority
UNABLE TO ASSESS — Insufficient context provided

Dimension 2: Relevance

How well the data meets the needs of users in terms of concepts measured and populations represented.

Assess the following:

Sub-element What to Assess

Scope and coverage Does the data cover the target population? Who or what is included and excluded? Are exclusions likely to cause bias?

Concepts and classifications Do the variables/fields measure the right concepts? Are classifications and categories appropriate for the intended use?

Reference period Does the reference period align with the user's analytical needs?

User needs alignment Does the data contain the fields necessary to answer the user's questions? Are there critical gaps?

What to look for in the data itself:

Column names and whether they map to the concepts the user needs
Coverage gaps — are expected categories, time periods, or geographic regions missing?
Proxy measures — are any fields indirect measures of what is actually needed?
Granularity — is the data at the right level of detail (e.g., individual vs aggregated)?

Rating guidance:

HIGH — Data directly measures required concepts, covers target population, correct reference period
ADEQUATE — Mostly relevant but with minor gaps in coverage or concepts
LOW — Significant misalignment between data content and user needs
UNABLE TO ASSESS — User needs not yet defined

Dimension 3: Timeliness

The delay between the reference period and data availability, and the currency of the data.

Assess the following:

Sub-element What to Assess

Reference period What period does the data relate to?

Collection/extraction date When was the data actually collected or extracted?

Lag What is the delay between the reference period and the data being made available?

Update frequency How often is the data refreshed? Is this sufficient for the intended use?

Currency for intended use Is the data current enough for the decisions it will inform?

What to look for in the data itself:

Date/timestamp columns — most recent and oldest values
File metadata — creation date, last modified date
Gaps in time series — missing periods that suggest delays or interruptions
Stale records — entries that appear outdated relative to the reference period

Rating guidance:

HIGH — Data is current, lag is minimal and acceptable for intended use
ADEQUATE — Some delay but still within acceptable bounds for the use case
LOW — Significant lag, data may be too outdated for reliable analysis
UNABLE TO ASSESS — Reference period or extraction dates unknown

Dimension 4: Accuracy

How well the data correctly describes the phenomena it was designed to measure.

This is typically the most data-intensive dimension to assess. Evaluate:

Sub-element What to Assess

Sampling error If the data is a sample, what is the sampling method? What is the margin of error? Is the sample size adequate?

Non-sampling error Errors from collection, processing, or coverage — not related to sampling

Coverage error Are units in the target population missing from or incorrectly included in the data?

Non-response error What proportion of expected responses are missing? Is there a pattern to non-response (systematic bias)?

Response/measurement error Are there signs of incorrect values from respondents or measurement instruments?

Processing error Are there signs of errors introduced during data entry, coding, editing, or transformation?

What to look for in the data itself:

Missing values — Count and percentage per column; patterns in missingness (MCAR, MAR, MNAR)
Outliers — Values that fall outside expected ranges (use IQR method, z-scores, or domain knowledge)
Impossible values — Negative ages, future dates, percentages > 100%, etc.
Duplicates — Exact duplicate rows or duplicate keys
Internal consistency — Do related fields agree? (e.g., start date before end date, totals matching sum of parts)
Data type violations — Numeric fields containing text, date fields with inconsistent formats
Truncation/rounding — Evidence of precision loss
Default/sentinel values — Suspicious repeated values (e.g., 9999, 0, "N/A", "TBD")

Rating guidance:

HIGH — Low missing data (<5%), no impossible values, few outliers, internally consistent
ADEQUATE — Some missing data (5-15%), minor inconsistencies, outliers present but explainable
LOW — High missing data (>15%), impossible values found, significant inconsistencies, unexplained outliers
UNABLE TO ASSESS — Insufficient domain knowledge to evaluate accuracy

Dimension 5: Coherence

The internal consistency of the data and its comparability with other sources over time.

Assess the following:

Sub-element What to Assess

Internal coherence Are aggregations consistent? Do sub-totals sum to totals? Are cross-tabulations consistent?

Temporal coherence Is the data consistent over time? Have definitions, classifications, or methods changed between periods?

Cross-source coherence Does this data align with other related datasets or published statistics?

Standards alignment Does the data use recognised standards, classifications, or coding schemes? (e.g., ANZSIC, ANZSCO, ISO country codes)

What to look for in the data itself:

Columns that should sum to a total — do they?
Categories that change names or codes across records or time periods
Inconsistent units of measurement within the same field
Fields that contradict each other within the same record
Comparison with known benchmarks (e.g., ABS published totals, industry standards)

Rating guidance:

HIGH — Internally consistent, uses recognised standards, aligns with external sources
ADEQUATE — Minor inconsistencies, mostly uses standards but with some deviations
LOW — Significant internal inconsistencies, no standards used, contradicts external sources
UNABLE TO ASSESS — No external benchmarks available for comparison

Dimension 6: Interpretability

The availability of information needed to understand and correctly use the data.

Assess the following:

Sub-element What to Assess

Concepts and definitions Are all variables clearly defined? Is there a data dictionary?

Classifications and coding Are classification schemes documented? Are codes explained?

Collection methodology Is the data collection method documented? (survey instrument, administrative form, sensor, etc.)

Data processing Are transformations, derivations, and cleaning steps documented?

Known limitations Are caveats, limitations, and appropriate use cases documented?

What to look for in the data itself:

Column naming clarity — are names self-explanatory or cryptic?
Coded values without a lookup table (e.g., "1", "2", "3" with no label)
Units of measurement — are they specified? (kg, tonnes, AUD, USD, metres)
Presence of a header row, metadata rows, or embedded documentation
Consistency of naming conventions across columns

Rating guidance:

HIGH — Comprehensive data dictionary, clear naming, documented methodology and limitations
ADEQUATE — Partial documentation, mostly understandable but some fields need clarification
LOW — No documentation, cryptic column names, unexplained codes, unknown methodology
UNABLE TO ASSESS — Cannot determine without additional metadata

Dimension 7: Accessibility

The ease with which users can find and access the data in a suitable format.

Assess the following:

Sub-element What to Assess

Discoverability Can users easily find that this data exists? Is it catalogued or registered?

Format suitability Is the data in a machine-readable, open format? (CSV, JSON, Parquet vs proprietary formats)

Access conditions Are there restrictions on access? (authentication, licensing, cost, approvals)

Technical accessibility Can the data be easily loaded into common tools? (file size, encoding, structure)

Documentation accessibility Is supporting documentation co-located with the data?

What to look for in the data itself:

File format — open standard or proprietary?
File size — manageable or requires special handling?
Encoding — UTF-8 or problematic encoding?
Structure — well-structured tabular data or requires significant parsing?
Embedded formatting issues — merged cells in Excel, inconsistent delimiters in CSV

Rating guidance:

HIGH — Open format, easily loadable, well-structured, documentation available alongside data
ADEQUATE — Accessible format but minor structural issues, documentation exists but is separate
LOW — Proprietary format, difficult to parse, poorly structured, no documentation
UNABLE TO ASSESS — Data access method unknown

Report Generation

After completing the assessment, generate a standalone HTML report using the template in references/report-template.html . The report uses SAS-AM branding with light/dark mode, colour-coded scorecard cells, a sidebar navigation, and a print-friendly layout.

Output Format

The primary output is a single HTML file (data-quality-report.html ) that:

Can be opened directly in any browser
Uses CDN fonts (Source Sans Pro, Source Code Pro) and icons (Font Awesome) — no build step
Includes light/dark mode toggle with localStorage persistence
Has a print button for PDF export via browser print dialog
Uses the SAS-AM colour system (SAS Blue #002244 / SAS Green #69BE28)
Colour-codes scorecard cells: green (HIGH), amber (ADEQUATE), red (LOW)

Report Structure

The HTML report follows this layout:

Fixed Header — Report title, theme toggle, print button
Sidebar Navigation — Jump links to all sections and dimensions
Report Title Section — Title, subtitle, metadata (dataset, source, date, intended use)
Executive Summary — Overall score ring, rating badge, dimension summary table
Data Profile — Stat cards (records, columns, duplicates, etc.) and schema table
Column-Level Quality Scorecard — Per-column metrics table with colour-coded cells, column summary scores, overall dataset score
Dimension Assessments (x7) — Cards with assessment, evidence, risks, recommendations
Recommendations — Grouped as Critical / Important / Minor with numbered items
Appendix — Profiling code block and ABS framework reference

Overall Quality Rating

Derive an overall quality rating based on the seven dimension ratings:

Overall Rating Criteria

HIGH All dimensions rated HIGH or ADEQUATE, with no more than one ADEQUATE

ADEQUATE Majority of dimensions rated ADEQUATE or above, no more than one LOW

LOW Two or more dimensions rated LOW

CONDITIONAL Quality is acceptable for some uses but not others — specify conditions

HTML Rating Classes

Use these CSS classes for rating badges and scorecard cells:

Rating Badge Class Cell Class

HIGH rating rating-high

cell-high

ADEQUATE rating rating-adequate

cell-adequate

LOW rating rating-low

cell-low

UNABLE TO ASSESS rating rating-unable

cell-na

Score Ring Calculation

The overall score ring in the executive summary uses an SVG circle with stroke-dashoffset :

circumference = 2 * π * 42 = 264 offset = circumference * (1 - score / 100)

For example, a score of 96.4% → offset = 264 * (1 - 0.964) = 9.5

Set stroke colour based on rating: var(--rating-high) for HIGH, var(--rating-adequate) for ADEQUATE, etc.

Workflow Summary

Step 1: Discovery

Conduct the discovery interview to understand:

What the data is and where it comes from
What the user intends to do with it
What quality concerns exist
What documentation is available

Step 2: Data Profiling & Column Scorecard

Load and profile the data:

Inspect schema and data types
Calculate descriptive statistics
Analyse missing data patterns
Detect outliers and anomalies
Check for duplicates
Validate data types and formats
Compute the column-level quality scorecard (completeness, validity, consistency, uniqueness, timeliness, accuracy per column)
Calculate column summary scores and overall dataset quality score

Use Python (pandas, numpy) or the appropriate tool for the data format. Write analysis code in a script file or notebook for reproducibility.

Step 3: Dimension Assessment

Work through each of the seven dimensions systematically:

Combine data profiling results with contextual information from discovery
Assign a rating for each dimension
Document evidence, risks, and recommendations

Step 4: Report Generation

Generate the HTML report:

Copy the report template from references/report-template.html
Replace all {{PLACEHOLDER}} values with findings
Populate scorecard table rows with colour-coded CSS classes (cell-high , cell-adequate , cell-low )
Populate dimension cards with assessment, evidence, risks, and recommendations
Populate recommendation items grouped by severity
Calculate the score ring offset: 264 * (1 - score / 100)
Save as data-quality-report.html (or a user-specified filename)

Step 5: Review & Handover

Present the report to the user
Highlight critical findings and risks
Discuss recommendations and next steps
Clarify if the report will feed into a formal quality declaration

Technical Notes

Supported Data Formats

The skill should handle any data format the user provides:

Tabular: CSV, TSV, Excel (.xlsx, .xls), Parquet, Feather, ORC
Structured: JSON, JSONL, XML
Database: SQL query results, SQLite files
Other: Fixed-width files, SAS datasets (.sas7bdat), Stata (.dta), SPSS (.sav)

Python Libraries for Profiling

When writing profiling code, prefer these libraries:

import pandas as pd import numpy as np

Load data

df = pd.read_csv("data.csv") # or appropriate reader

Schema

print(df.dtypes) print(df.shape)

Descriptive statistics

print(df.describe(include='all'))

Missing data

missing = df.isnull().sum() missing_pct = (missing / len(df) * 100).round(2) print(pd.DataFrame({'missing_count': missing, 'missing_pct': missing_pct}))

Duplicates

print(f"Duplicate rows: {df.duplicated().sum()}")

Value distributions for categorical columns

for col in df.select_dtypes(include='object').columns: print(f"\n{col}:") print(df[col].value_counts().head(10))

Handling Large Datasets

For datasets too large to load fully into memory:

Use chunked reading (pd.read_csv(..., chunksize=10000) )
Profile a representative sample first
Note in the report that profiling was done on a sample and state the sample size

Content Guidelines

Australian English

Use Australian English spelling throughout:

analyse (not analyze)
organisation (not organization)
colour (not color)
programme (not program, unless referring to software)
licence (noun) / license (verb)
centre (not center)

Tone

Professional and objective
Evidence-based — every finding must be supported by data
Actionable — recommendations should be specific and practical
Proportionate — focus attention on the dimensions that matter most for the intended use

Ratings

Be honest about quality — do not inflate ratings
Use "UNABLE TO ASSESS" when genuine uncertainty exists rather than guessing
Always explain the rationale for each rating
Consider the intended use when assessing fitness for purpose

Framework Reference

Source: Australian Bureau of Statistics, ABS Data Quality Framework, Catalogue No. 1520.0, May 2009.

The ABS DQF was developed based on:

Statistics Canada Quality Assurance Framework
European Statistics Code of Practice

Key principle: Quality is fitness for purpose. The same dataset may be high quality for one use and low quality for another. Always assess quality relative to the intended use.

Trade-offs: There are inherent trade-offs between dimensions. For example:

Timeliness vs Accuracy — releasing data faster may reduce time for quality checks
Accuracy vs Accessibility — more detailed data may be harder to access due to confidentiality
Relevance vs Coherence — tailoring data to specific needs may reduce comparability

Checklist

Before delivering the report, verify:

Discovery interview completed (or assumptions clearly documented)
Data successfully loaded and profiled
Column-level quality scorecard computed with all six metrics
Column summary scores and overall dataset quality score calculated
All seven dimensions assessed with ratings
Evidence provided for each assessment
Risks identified for each dimension
Recommendations are specific and actionable
Rating summary table included in executive summary
Overall quality rating derived and justified
Australian English spelling used throughout
Report saved as standalone HTML file
Light/dark mode toggle works correctly
Scorecard cells are colour-coded (green/amber/red)
Score ring displays correct percentage and offset
Sidebar navigation links to all sections
Print layout renders cleanly (test with Ctrl+P)
Any analysis code saved for reproducibility

data-quality-analysis

Safety Notice

Copy this and send it to your AI assistant to learn

Example for a numeric column that should be positive

Example: check case consistency for a categorical column

For date columns: how recent is the most recent value?

Load data

Schema

Descriptive statistics

Missing data

Duplicates

Value distributions for categorical columns

Source Transparency

Related Skills

sas-presentation

sasam-update

HTML to Markdown

Podfetcher Tools