SQL Queries Skill
Write correct, performant, readable SQL across all major data warehouse dialects.
Dialect-Specific Reference
PostgreSQL (including Aurora, RDS, Supabase, Neon)
Date/time:
-- Current date/time CURRENT_DATE, CURRENT_TIMESTAMP, NOW()
-- Date arithmetic date_column + INTERVAL '7 days' date_column - INTERVAL '1 month'
-- Truncate to period DATE_TRUNC('month', created_at)
-- Extract parts EXTRACT(YEAR FROM created_at) EXTRACT(DOW FROM created_at) -- 0=Sunday
-- Format TO_CHAR(created_at, 'YYYY-MM-DD')
String functions:
-- Concatenation first_name || ' ' || last_name CONCAT(first_name, ' ', last_name)
-- Pattern matching column ILIKE '%pattern%' -- case-insensitive column ~ '^regex_pattern$' -- regex
-- String manipulation LEFT(str, n), RIGHT(str, n) SPLIT_PART(str, delimiter, position) REGEXP_REPLACE(str, pattern, replacement)
Arrays and JSON:
-- JSON access data->>'key' -- text data->'nested'->'key' -- json data#>>'{path,to,key}' -- nested text
-- Array operations ARRAY_AGG(column) ANY(array_column) array_column @> ARRAY['value']
Performance tips:
-
Use EXPLAIN ANALYZE to profile queries
-
Create indexes on frequently filtered/joined columns
-
Use EXISTS over IN for correlated subqueries
-
Partial indexes for common filter conditions
-
Use connection pooling for concurrent access
Snowflake
Date/time:
-- Current date/time CURRENT_DATE(), CURRENT_TIMESTAMP(), SYSDATE()
-- Date arithmetic DATEADD(day, 7, date_column) DATEDIFF(day, start_date, end_date)
-- Truncate to period DATE_TRUNC('month', created_at)
-- Extract parts YEAR(created_at), MONTH(created_at), DAY(created_at) DAYOFWEEK(created_at)
-- Format TO_CHAR(created_at, 'YYYY-MM-DD')
String functions:
-- Case-insensitive by default (depends on collation) column ILIKE '%pattern%' REGEXP_LIKE(column, 'pattern')
-- Parse JSON column:key::string -- dot notation for VARIANT PARSE_JSON('{"key": "value"}') GET_PATH(variant_col, 'path.to.key')
-- Flatten arrays/objects SELECT f.value FROM table, LATERAL FLATTEN(input => array_col) f
Semi-structured data:
-- VARIANT type access data:customer:name::STRING data:items[0]:price::NUMBER
-- Flatten nested structures SELECT t.id, item.value:name::STRING as item_name, item.value:qty::NUMBER as quantity FROM my_table t, LATERAL FLATTEN(input => t.data:items) item
Performance tips:
-
Use clustering keys on large tables (not traditional indexes)
-
Filter on clustering key columns for partition pruning
-
Set appropriate warehouse size for query complexity
-
Use RESULT_SCAN(LAST_QUERY_ID()) to avoid re-running expensive queries
-
Use transient tables for staging/temp data
BigQuery (Google Cloud)
Date/time:
-- Current date/time CURRENT_DATE(), CURRENT_TIMESTAMP()
-- Date arithmetic DATE_ADD(date_column, INTERVAL 7 DAY) DATE_SUB(date_column, INTERVAL 1 MONTH) DATE_DIFF(end_date, start_date, DAY) TIMESTAMP_DIFF(end_ts, start_ts, HOUR)
-- Truncate to period DATE_TRUNC(created_at, MONTH) TIMESTAMP_TRUNC(created_at, HOUR)
-- Extract parts EXTRACT(YEAR FROM created_at) EXTRACT(DAYOFWEEK FROM created_at) -- 1=Sunday
-- Format FORMAT_DATE('%Y-%m-%d', date_column) FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', ts_column)
String functions:
-- No ILIKE, use LOWER() LOWER(column) LIKE '%pattern%' REGEXP_CONTAINS(column, r'pattern') REGEXP_EXTRACT(column, r'pattern')
-- String manipulation SPLIT(str, delimiter) -- returns ARRAY ARRAY_TO_STRING(array, delimiter)
Arrays and structs:
-- Array operations ARRAY_AGG(column) UNNEST(array_column) ARRAY_LENGTH(array_column) value IN UNNEST(array_column)
-- Struct access struct_column.field_name
Performance tips:
-
Always filter on partition columns (usually date) to reduce bytes scanned
-
Use clustering for frequently filtered columns within partitions
-
Use APPROX_COUNT_DISTINCT() for large-scale cardinality estimates
-
Avoid SELECT * -- billing is per-byte scanned
-
Use DECLARE and SET for parameterized scripts
-
Preview query cost with dry run before executing large queries
Redshift (Amazon)
Date/time:
-- Current date/time CURRENT_DATE, GETDATE(), SYSDATE
-- Date arithmetic DATEADD(day, 7, date_column) DATEDIFF(day, start_date, end_date)
-- Truncate to period DATE_TRUNC('month', created_at)
-- Extract parts EXTRACT(YEAR FROM created_at) DATE_PART('dow', created_at)
String functions:
-- Case-insensitive column ILIKE '%pattern%' REGEXP_INSTR(column, 'pattern') > 0
-- String manipulation SPLIT_PART(str, delimiter, position) LISTAGG(column, ', ') WITHIN GROUP (ORDER BY column)
Performance tips:
-
Design distribution keys for collocated joins (DISTKEY)
-
Use sort keys for frequently filtered columns (SORTKEY)
-
Use EXPLAIN to check query plan
-
Avoid cross-node data movement (watch for DS_BCAST and DS_DIST)
-
ANALYZE and VACUUM regularly
-
Use late-binding views for schema flexibility
Databricks SQL
Date/time:
-- Current date/time CURRENT_DATE(), CURRENT_TIMESTAMP()
-- Date arithmetic DATE_ADD(date_column, 7) DATEDIFF(end_date, start_date) ADD_MONTHS(date_column, 1)
-- Truncate to period DATE_TRUNC('MONTH', created_at) TRUNC(date_column, 'MM')
-- Extract parts YEAR(created_at), MONTH(created_at) DAYOFWEEK(created_at)
Delta Lake features:
-- Time travel SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15' SELECT * FROM my_table VERSION AS OF 42
-- Describe history DESCRIBE HISTORY my_table
-- Merge (upsert) MERGE INTO target USING source ON target.id = source.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
Performance tips:
-
Use Delta Lake's OPTIMIZE and ZORDER for query performance
-
Leverage Photon engine for compute-intensive queries
-
Use CACHE TABLE for frequently accessed datasets
-
Partition by low-cardinality date columns
Common SQL Patterns
Window Functions
-- Ranking ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) RANK() OVER (PARTITION BY category ORDER BY revenue DESC) DENSE_RANK() OVER (ORDER BY score DESC)
-- Running totals / moving averages SUM(revenue) OVER (ORDER BY date_col ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total AVG(revenue) OVER (ORDER BY date_col ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg_7d
-- Lag / Lead LAG(value, 1) OVER (PARTITION BY entity ORDER BY date_col) as prev_value LEAD(value, 1) OVER (PARTITION BY entity ORDER BY date_col) as next_value
-- First / Last value FIRST_VALUE(status) OVER (PARTITION BY user_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) LAST_VALUE(status) OVER (PARTITION BY user_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-- Percent of total revenue / SUM(revenue) OVER () as pct_of_total revenue / SUM(revenue) OVER (PARTITION BY category) as pct_of_category
CTEs for Readability
WITH -- Step 1: Define the base population base_users AS ( SELECT user_id, created_at, plan_type FROM users WHERE created_at >= DATE '2024-01-01' AND status = 'active' ),
-- Step 2: Calculate user-level metrics user_metrics AS ( SELECT u.user_id, u.plan_type, COUNT(DISTINCT e.session_id) as session_count, SUM(e.revenue) as total_revenue FROM base_users u LEFT JOIN events e ON u.user_id = e.user_id GROUP BY u.user_id, u.plan_type ),
-- Step 3: Aggregate to summary level summary AS ( SELECT plan_type, COUNT(*) as user_count, AVG(session_count) as avg_sessions, SUM(total_revenue) as total_revenue FROM user_metrics GROUP BY plan_type )
SELECT * FROM summary ORDER BY total_revenue DESC;
Cohort Retention
WITH cohorts AS ( SELECT user_id, DATE_TRUNC('month', first_activity_date) as cohort_month FROM users ), activity AS ( SELECT user_id, DATE_TRUNC('month', activity_date) as activity_month FROM user_activity ) SELECT c.cohort_month, COUNT(DISTINCT c.user_id) as cohort_size, COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month THEN a.user_id END) as month_0, COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '1 month' THEN a.user_id END) as month_1, COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '3 months' THEN a.user_id END) as month_3 FROM cohorts c LEFT JOIN activity a ON c.user_id = a.user_id GROUP BY c.cohort_month ORDER BY c.cohort_month;
Funnel Analysis
WITH funnel AS ( SELECT user_id, MAX(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) as step_1_view, MAX(CASE WHEN event = 'signup_start' THEN 1 ELSE 0 END) as step_2_start, MAX(CASE WHEN event = 'signup_complete' THEN 1 ELSE 0 END) as step_3_complete, MAX(CASE WHEN event = 'first_purchase' THEN 1 ELSE 0 END) as step_4_purchase FROM events WHERE event_date >= CURRENT_DATE - INTERVAL '30 days' GROUP BY user_id ) SELECT COUNT(*) as total_users, SUM(step_1_view) as viewed, SUM(step_2_start) as started_signup, SUM(step_3_complete) as completed_signup, SUM(step_4_purchase) as purchased, ROUND(100.0 * SUM(step_2_start) / NULLIF(SUM(step_1_view), 0), 1) as view_to_start_pct, ROUND(100.0 * SUM(step_3_complete) / NULLIF(SUM(step_2_start), 0), 1) as start_to_complete_pct, ROUND(100.0 * SUM(step_4_purchase) / NULLIF(SUM(step_3_complete), 0), 1) as complete_to_purchase_pct FROM funnel;
Deduplication
-- Keep the most recent record per key WITH ranked AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY entity_id ORDER BY updated_at DESC ) as rn FROM source_table ) SELECT * FROM ranked WHERE rn = 1;
Error Handling and Debugging
When a query fails:
-
Syntax errors: Check for dialect-specific syntax (e.g., ILIKE not available in BigQuery, SAFE_DIVIDE only in BigQuery)
-
Column not found: Verify column names against schema -- check for typos, case sensitivity (PostgreSQL is case-sensitive for quoted identifiers)
-
Type mismatches: Cast explicitly when comparing different types (CAST(col AS DATE) , col::DATE )
-
Division by zero: Use NULLIF(denominator, 0) or dialect-specific safe division
-
Ambiguous columns: Always qualify column names with table alias in JOINs
-
Group by errors: All non-aggregated columns must be in GROUP BY (except in BigQuery which allows grouping by alias)