discovering-data

Discover what data exists for a concept or domain. Answer "What data do we have about X?"

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "discovering-data" with this command: npx skills add astronomer/agents/astronomer-agents-discovering-data

Data Exploration

Discover what data exists for a concept or domain. Answer "What data do we have about X?"

Fast Table Validation

When you have multiple candidate tables, quickly validate before committing to complex queries.

Strategy: Progressive Complexity

Start with the simplest possible query, then add complexity only after each step succeeds:

Step 1: Does the data exist? → Simple LIMIT query, no JOINs Step 2: How much data? → COUNT(*) with same filters Step 3: What are the key IDs? → SELECT DISTINCT foreign_keys LIMIT 100 Step 4: Get related details → JOIN on the specific IDs from step 3

Never jump from step 1 to complex aggregations. If step 1 returns 50 rows, use those IDs directly:

-- After finding deployment_ids in step 1: SELECT o.org_name, d.deployment_name FROM DEPLOYMENTS d JOIN ORGANIZATIONS o ON d.org_id = o.org_id WHERE d.deployment_id IN ('id1', 'id2', 'id3') -- IDs from step 1

When a Metadata Table Returns 0 Results

If a smaller metadata/config table (like *_LOG , *_CONFIG ) returns 0 results, check the execution/fact table before concluding data doesn't exist.

Metadata tables may have gaps or lag. The actual execution data (in tables with millions/billions of rows) is often more complete.

Use Row Counts as a Signal

When list_tables returns row counts:

  • Millions+ rows → likely execution/fact data (actual events, transactions, runs)

  • Thousands of rows → likely metadata/config (what's configured, not what happened)

For questions like "who is using X" or "how many times did Y happen", prioritize high-row-count tables first - they contain actual activity data.

⚠️ CRITICAL: Tables with 1B+ rows require special handling

If you see a table with billions of rows (like 6B), you MUST:

  • Use simple queries only: SELECT col1, col2 FROM table WHERE filter LIMIT 100

  • NO JOINs, NO GROUP BY, NO aggregations on the first query

  • Only add complexity after the simple query succeeds

If your query times out, simplify it - don't give up. Remove JOINs, remove GROUP BY, add LIMIT.

Example: Finding Feature Usage

If looking for "customers using feature X" and you see:

  • FEATURE_CONFIG (50K rows) - likely config/metadata

  • USER_EVENTS (500M rows) - likely execution data

Try the larger table first with a quick validation:

SELECT COUNT(*) FROM USER_EVENTS WHERE feature ILIKE '%X%' AND event_ts >= DATEADD(day, -30, CURRENT_DATE)

If count > 0, proceed. If 0, try the config table.

Querying Large Tables (100M+ rows)

Pattern: Find examples first, aggregate later

For billion-row tables, even ILIKE with date filters can timeout. Use LIMIT on a simple query (no JOINs, no GROUP BY):

-- Step 1: Find examples (fast - stops after finding matches) -- NO JOINS, NO GROUP BY - just find rows SELECT col_a, col_b, foreign_key_id FROM huge_table WHERE col_a ILIKE '%term%' AND ts >= DATEADD(day, -30, CURRENT_DATE) LIMIT 100

-- Step 2: Use foreign keys from step 1 to get details SELECT o.name, o.details FROM other_table o WHERE o.id IN ('id1', 'id2', 'id3') -- IDs from step 1

CRITICAL: LIMIT only helps without GROUP BY

-- ❌ STILL SLOW: LIMIT with GROUP BY - must scan ALL rows first to compute groups SELECT col, COUNT(*) FROM huge_table WHERE x ILIKE '%term%' GROUP BY col LIMIT 100

-- ✅ FAST: LIMIT without GROUP BY - stops after finding 100 rows SELECT col, id FROM huge_table WHERE x ILIKE '%term%' LIMIT 100

Anti-patterns for large tables:

  • ❌ JOINs + GROUP BY + LIMIT (LIMIT doesn't help)

  • ❌ UPPER(col) LIKE '%TERM%' (use ILIKE instead)

  • ❌ Wide date ranges without LIMIT

Exploration Process

Step 1: Search for Relevant Tables

Search across all schemas for tables matching the concept:

SELECT TABLE_CATALOG as database, TABLE_SCHEMA as schema, TABLE_NAME as table_name, ROW_COUNT, COMMENT as description FROM <database>.INFORMATION_SCHEMA.TABLES WHERE LOWER(TABLE_NAME) LIKE '%<concept>%' OR LOWER(COMMENT) LIKE '%<concept>%' ORDER BY TABLE_SCHEMA, TABLE_NAME LIMIT 30

Also check for related terms:

  • Synonyms (e.g., "revenue" for "ARR", "client" for "customer")

  • Abbreviations (e.g., "arr" for "annual recurring revenue")

  • Related concepts (e.g., "orders" often relates to "customers")

Step 2: Categorize by Data Layer

Group discovered tables by their role in the data architecture:

Layer Naming Patterns Purpose

Raw/Staging raw_ , stg_ , staging_

Source data, minimal transformation

Intermediate int_ , base_ , prep_

Cleaned, joined, business logic applied

Marts/Facts fct_ , fact_ , mart_

Business metrics, analysis-ready

Dimensions dim_ , dimension_

Reference/lookup tables

Aggregates agg_ , summary_ , daily_

Pre-computed rollups

Reporting rpt_ , report_ , dashboard_

BI/reporting optimized

Step 3: Get Schema Details

For the most relevant tables (typically 2-5), use get_tables_info to retrieve:

  • Column names and types

  • Column descriptions

  • Key fields

Focus on tables that appear to be:

  • The "main" or canonical source

  • Most recent/actively maintained

  • Appropriate grain for analysis

Step 4: Understand Relationships

Identify how tables relate to each other:

-- Look for common key columns SELECT COLUMN_NAME, COUNT() as table_count FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME IN (<discovered_tables>) GROUP BY COLUMN_NAME HAVING COUNT() > 1 ORDER BY table_count DESC

Common relationship patterns:

  • customer_id joins customer tables

  • order_id joins order tables

  • date or event_date for time-series alignment

Step 5: Check Data Freshness

For key tables, verify they're actively maintained:

SELECT MAX(<timestamp_column>) as last_update, COUNT(*) as row_count FROM <table>

Flag tables that:

  • Haven't been updated recently

  • Have suspiciously low row counts

  • Might be deprecated

Step 6: Sample the Data

For the primary table(s), get sample rows to understand content:

SELECT * FROM <table> LIMIT 10

Output: Exploration Report

Summary

One paragraph explaining what data exists for this concept and which tables are most useful.

Discovered Tables

Table Schema Rows Last Updated Purpose

acct_product_arr

MART_FINANCE 42K Today Primary ARR by account/product

usage_arr_daily

METRICS 800K Today Daily usage-based ARR detail

arr_change_multi

METRICS 15K Today ARR movement tracking

Recommended Tables

For most analysis, use: MART_FINANCE.ACCT_PRODUCT_ARR

  • Monthly grain, account-level

  • Includes both contract and usage ARR

  • Has current month flag for easy filtering

For daily granularity: METRICS_FINANCE.USAGE_ARR_DAILY

  • Day-level detail

  • Usage/consumption based ARR only

For ARR movements: METRICS_FINANCE.ARR_CHANGE_MULTI

  • New, expansion, contraction, churn

  • Good for cohort analysis

Key Schema Details

For the primary table(s), show:

Column Type Description

acct_id

VARCHAR Account identifier

arr_amt

NUMBER Total ARR amount

eom_date

DATE End of month date

Relationships

[dim.customers] --< [fct.orders] --< [agg.daily_sales] | | +--< [fct.arr] ----+

Sample Queries

Provide 3-5 starter queries for common questions:

-- Total ARR by product SELECT product, SUM(arr_amt) as total_arr FROM mart_finance.acct_product_arr WHERE is_current_mth = TRUE GROUP BY product;

-- Top 10 customers SELECT parent_name, SUM(arr_amt) as arr FROM mart_finance.acct_product_arr WHERE is_current_mth = TRUE GROUP BY parent_name ORDER BY arr DESC LIMIT 10;

Next Steps

Suggest logical follow-ups:

  • "To deep-dive on a specific table, use the profiling-tables skill"

  • "To check data freshness, use the checking-freshness skill"

  • "To understand where this data comes from, use the tracing-upstream-lineage skill"

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

airflow

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

analyzing-data

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

authoring-dags

No summary provided by upstream source.

Repository SourceNeeds Review