data-classification

Data Classification Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-classification" with this command: npx skills add vivekgana/databricks-platform-marketplace/vivekgana-databricks-platform-marketplace-data-classification

Data Classification Skill

Overview

Comprehensive data classification framework for identifying, tagging, and managing data based on sensitivity, regulatory requirements, and business context.

Classification Framework

Sensitivity Levels

  • PUBLIC: Approved for public disclosure

  • INTERNAL: Internal business use only

  • CONFIDENTIAL: Sensitive business data

  • RESTRICTED: Highly sensitive regulated data (PII, PHI, PCI)

Data Domains

  • CUSTOMER: Customer and prospect data

  • FINANCIAL: Financial and accounting data

  • EMPLOYEE: HR and employee information

  • PRODUCT: Product and service data

  • OPERATIONAL: System and operational data

PII Categories

  • DIRECT_IDENTIFIERS: Name, email, SSN, phone

  • QUASI_IDENTIFIERS: Zip code, DOB, gender

  • SENSITIVE_ATTRIBUTES: Health, race, religion, biometrics

  • FINANCIAL_DATA: Credit cards, bank accounts, salary

Classification Methods

  1. Schema-Based Classification

def classify_by_schema(column_name: str, data_type: str) -> dict: """Classify based on column name and type.""" classification = {"sensitivity": "INTERNAL"}

col_lower = column_name.lower()

if any(x in col_lower for x in ['ssn', 'social_security', 'tax_id']):
    classification = {"sensitivity": "RESTRICTED", "pii": "DIRECT_IDENTIFIERS"}
elif any(x in col_lower for x in ['email', 'phone', 'address']):
    classification = {"sensitivity": "CONFIDENTIAL", "pii": "DIRECT_IDENTIFIERS"}
elif any(x in col_lower for x in ['salary', 'credit_card', 'bank_account']):
    classification = {"sensitivity": "RESTRICTED", "pii": "FINANCIAL_DATA"}

return classification

2. Content-Based Classification

def classify_by_content(table: str, column: str, sample_rate: float = 0.01): """Sample data content for classification.""" samples = spark.table(table).sample(sample_rate).select(column).limit(100)

pii_patterns = {
    'SSN': r'\b\d{3}-?\d{2}-?\d{4}\b',
    'EMAIL': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    'CREDIT_CARD': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
    'PHONE': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
}

detected_pii = []
for row in samples.collect():
    value = str(row[0])
    for pii_type, pattern in pii_patterns.items():
        if re.match(pattern, value):
            detected_pii.append(pii_type)

return list(set(detected_pii))

3. Tag Application

-- Create classification tags CREATE TAG governance.sensitivity; CREATE TAG governance.data_domain; CREATE TAG governance.pii_category;

-- Apply to catalog ALTER CATALOG production SET TAGS ('governance.data_domain' = 'CUSTOMER');

-- Apply to table ALTER TABLE production.customers.profiles SET TAGS ( 'governance.sensitivity' = 'RESTRICTED', 'governance.pii_category' = 'DIRECT_IDENTIFIERS' );

-- Apply to column ALTER TABLE production.customers.profiles ALTER COLUMN email SET TAGS ('governance.sensitivity' = 'CONFIDENTIAL');

Automated Classification

class DataClassifier: def init(self): self.classification_rules = self.load_rules()

def classify_catalog(self, catalog: str):
    """Auto-classify entire catalog."""
    schemas = list_schemas(catalog)
    for schema in schemas:
        tables = list_tables(catalog, schema)
        for table in tables:
            self.classify_table(f"{catalog}.{schema}.{table}")

def classify_table(self, table_name: str):
    """Classify table and columns."""
    table_info = get_table_info(table_name)

    # Table-level classification
    table_class = self.infer_table_classification(table_info)
    self.apply_table_tags(table_name, table_class)

    # Column-level classification
    for column in table_info.columns:
        column_class = self.classify_column(column)
        self.apply_column_tags(table_name, column.name, column_class)

def classify_column(self, column) -> dict:
    """Classify individual column."""
    # Schema-based
    schema_class = classify_by_schema(column.name, column.type)

    # Content-based (if high confidence not achieved)
    if schema_class["sensitivity"] == "INTERNAL":
        content_class = classify_by_content(table, column.name)
        if content_class:
            return content_class

    return schema_class

Best Practices

  • Start with High-Value Data: Classify PII and regulated data first

  • Automate Where Possible: Use rules and ML for consistency

  • Human Validation: Review automated classifications

  • Document Rationale: Maintain classification decisions

  • Regular Re-classification: Update when schemas change

  • Align with Policies: Link classification to access/retention policies

Templates

  • classification-rules.yaml: Classification rule definitions

  • tag-taxonomy.sql: Tag schema and values

  • classification-workflow.py: Automated classification pipeline

Examples

  • pii-classification: PII detection and tagging

  • sensitivity-assignment: Sensitivity level classification

  • compliance-mapping: Map classifications to regulations

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

databricks-asset-bundles

No summary provided by upstream source.

Repository SourceNeeds Review
General

medallion-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
General

delta-live-tables

No summary provided by upstream source.

Repository SourceNeeds Review