BigQuery Query Optimization
Use this skill when writing, debugging, or optimizing BigQuery SQL queries for performance and efficiency.
Query Execution Analysis
Using EXPLAIN
-- Get execution plan
EXPLAIN SELECT * FROM project.dataset.table WHERE condition;
-- Get execution plan with runtime stats
EXPLAIN ANALYZE SELECT * FROM project.dataset.table WHERE condition;
What the plan shows:
-
Stages of execution
-
Bytes read per stage
-
Slot time consumed
-
Potential bottlenecks
Performance Best Practices
- Avoid SELECT *
❌ Bad (scans all columns):
SELECT * FROM project.dataset.large_table
✅ Good (only needed columns):
SELECT customer_id, amount, date
FROM project.dataset.large_table
Impact: Full table scan vs targeted column read. Can reduce data scanned by 90%+.
- Filter Early and Often
❌ Bad (filter after aggregation):
SELECT customer_id, SUM(amount) as total
FROM project.dataset.orders
GROUP BY customer_id
HAVING SUM(amount) > 1000
✅ Good (filter before aggregation):
SELECT customer_id, SUM(amount) as total
FROM project.dataset.orders
WHERE amount > 100 -- Filter early
GROUP BY customer_id
HAVING SUM(amount) > 1000
- Use Partitioned Tables
Without partition filter:
-- Scans entire table
SELECT * FROM project.dataset.orders
WHERE order_date >= '2024-01-01'
With partition filter:
-- Only scans relevant partitions
SELECT * FROM project.dataset.orders
WHERE DATE(order_timestamp) >= '2024-01-01' -- Partition column
Key: Filter on the partition column for automatic partition pruning.
- Break Complex Queries
❌ Anti-pattern (one huge query):
SELECT ... FROM ( SELECT ... FROM ( SELECT ... -- Deeply nested ) ) WHERE ...
✅ Good (use CTEs):
WITH base_data AS (
SELECT customer_id, amount, date
FROM project.dataset.orders
WHERE date >= '2024-01-01'
),
aggregated AS (
SELECT customer_id, SUM(amount) as total
FROM base_data
GROUP BY customer_id
)
SELECT * FROM aggregated WHERE total > 1000
✅ Better (multi-statement with temp tables):
CREATE TEMP TABLE base_data AS
SELECT customer_id, amount, date
FROM project.dataset.orders
WHERE date >= '2024-01-01';
CREATE TEMP TABLE aggregated AS SELECT customer_id, SUM(amount) as total FROM base_data GROUP BY customer_id;
SELECT * FROM aggregated WHERE total > 1000;
- JOIN Optimization
Put largest table first:
-- ✅ Large table first
SELECT l.*, s.detail
FROM project.dataset.large_table l
JOIN project.dataset.small_table s
ON l.id = s.id
Use clustering on JOIN columns:
-
Cluster tables on frequently joined columns
-
BigQuery can prune data blocks more effectively
Consider ARRAY/STRUCT for 1:many:
-- Instead of JOIN for 1:many relationships
SELECT
order_id,
ARRAY_AGG(STRUCT(product_id, quantity, price)) as items
FROM project.dataset.order_items
GROUP BY order_id
- Leverage Automatic Features
BigQuery automatically performs:
-
Query rewrites - Optimizes query structure
-
Partition pruning - With proper filters
-
Dynamic filtering - Reduces data scanned
Ensure your queries enable these:
-
Filter on partition columns
-
Use simple, clear predicates
-
Avoid functions on partition columns in WHERE clause
Parameterized Queries
CLI Syntax
bq query
--use_legacy_sql=false
--parameter=start_date:DATE:2024-01-01
--parameter=end_date:DATE:2024-12-31
--parameter=min_amount:FLOAT64:100.0
'SELECT *
FROM project.dataset.orders
WHERE order_date BETWEEN @start_date AND @end_date
AND amount >= @min_amount'
Python Syntax
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT customer_id, SUM(amount) as total
FROM project.dataset.orders
WHERE order_date >= @start_date
GROUP BY customer_id
"""
job_config = bigquery.QueryJobConfig( query_parameters=[ bigquery.ScalarQueryParameter("start_date", "DATE", "2024-01-01") ] )
results = client.query_and_wait(query, job_config=job_config)
Key points:
-
Use @ prefix for named parameters
-
Syntax: name:TYPE:value or name::value (STRING default)
-
Cannot use as column/table names
-
Only works with standard SQL
User-Defined Functions (UDFs)
SQL UDF (Simple)
CREATE TEMP FUNCTION CleanEmail(email STRING) RETURNS STRING AS ( LOWER(TRIM(email)) );
SELECT CleanEmail(customer_email) as email
FROM project.dataset.customers;
JavaScript UDF (Complex Logic)
CREATE TEMP FUNCTION ParseUserAgent(ua STRING) RETURNS STRUCT<browser STRING, version STRING> LANGUAGE js AS r""" var match = ua.match(/(Chrome|Firefox|Safari)/(\d+)/); return { browser: match ? match[1] : 'Unknown', version: match ? match[2] : '0' }; """;
SELECT ParseUserAgent(user_agent).browser as browser
FROM project.dataset.sessions;
Limitations:
-
INT64 unsupported in JavaScript (use FLOAT64 or STRING)
-
JavaScript doesn't support 64-bit integers natively
Persistent UDFs
-- Create once, use many times
CREATE FUNCTION project.dataset.clean_email(email STRING)
RETURNS STRING
AS (LOWER(TRIM(email)));
-- Use anywhere
SELECT project.dataset.clean_email(email) FROM ...
Scripting & Procedural Language
Variables
DECLARE total_orders INT64;
SET total_orders = (SELECT COUNT(*) FROM project.dataset.orders);
SELECT total_orders;
Loops
LOOP:
DECLARE x INT64 DEFAULT 0; LOOP SET x = x + 1; IF x >= 10 THEN LEAVE; END IF; END LOOP;
WHILE:
DECLARE x INT64 DEFAULT 0; WHILE x < 10 DO SET x = x + 1; END WHILE;
FOR with arrays:
DECLARE ids ARRAY<STRING>; SET ids = ['id1', 'id2', 'id3'];
FOR item IN (SELECT * FROM UNNEST(ids) as id) DO -- Process each id SELECT id; END FOR;
Query Caching
Automatic caching (24 hours):
-
Identical queries serve cached results (free)
-
No additional cost
-
Instant response
To bypass cache:
bq query --use_cache=false 'SELECT...'
Common Anti-Patterns
❌ Using LIMIT to reduce cost
-- LIMIT doesn't reduce data scanned or cost!
SELECT * FROM project.dataset.huge_table LIMIT 10
Impact: Still scans entire table. Use WHERE filters instead.
❌ Functions on partition columns
-- Prevents partition pruning WHERE CAST(date_column AS STRING) = '2024-01-01'
✅ Better:
WHERE date_column = DATE('2024-01-01')
❌ Cross joins without filters
-- Cartesian product = huge result SELECT * FROM table1 CROSS JOIN table2
Impact: Can generate millions/billions of rows.
❌ Correlated subqueries
-- Runs subquery for each row SELECT * FROM orders o WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id)
✅ Better (use window functions):
SELECT * FROM ( SELECT *, AVG(amount) OVER (PARTITION BY customer_id) as avg_amount FROM orders ) WHERE amount > avg_amount
Common SQL Gotchas
CTE Re-execution (CRITICAL COST ISSUE)
Problem: When a CTE is referenced multiple times, BigQuery re-executes it each time, billing you multiple times.
❌ Bad (CTE runs 3 times - billed 3x):
WITH expensive_cte AS (
SELECT * FROM project.dataset.huge_table
WHERE complex_conditions
AND lots_of_joins
)
SELECT COUNT(*) FROM expensive_cte
UNION ALL
SELECT SUM(amount) FROM expensive_cte
UNION ALL
SELECT MAX(date) FROM expensive_cte;
Impact: If the CTE scans 10 TB, you're billed for 30 TB (10 TB × 3).
✅ Good (use temp table - billed 1x):
CREATE TEMP TABLE expensive_data AS
SELECT * FROM project.dataset.huge_table
WHERE complex_conditions
AND lots_of_joins;
SELECT COUNT(*) FROM expensive_data UNION ALL SELECT SUM(amount) FROM expensive_data UNION ALL SELECT MAX(date) FROM expensive_data;
When CTEs are OK:
-
Referenced only once
-
Very small result set
-
Part of larger query (BigQuery may optimize)
When to use temp tables:
-
CTE referenced 2+ times
-
Large data volumes
-
Complex/expensive CTE query
NOT IN with NULL Values (SILENT FAILURE)
Problem: NOT IN returns NOTHING (empty result) if ANY NULL exists in the subquery.
❌ Broken (returns empty if blocked_customers has any NULL):
SELECT * FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM blocked_customers -- If ANY NULL, returns 0 rows! );
Why it fails:
-
SQL three-valued logic: TRUE, FALSE, UNKNOWN
-
NULL IN (...) evaluates to UNKNOWN
-
NOT UNKNOWN is still UNKNOWN
-
Rows with UNKNOWN are filtered out
✅ Solution 1: Use NOT EXISTS (safest):
SELECT * FROM customers c WHERE NOT EXISTS ( SELECT 1 FROM blocked_customers b WHERE b.customer_id = c.customer_id );
✅ Solution 2: Filter NULLs explicitly:
SELECT * FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM blocked_customers WHERE customer_id IS NOT NULL -- Explicit NULL filter );
✅ Solution 3: Use LEFT JOIN:
SELECT c.* FROM customers c LEFT JOIN blocked_customers b ON c.customer_id = b.customer_id WHERE b.customer_id IS NULL;
Best practice: Prefer NOT EXISTS
- it's clearer, safer, and often faster.
DML Statement Performance
Problem: BigQuery is optimized for analytics (OLAP), not transactional updates (OLTP). DML statements are slow and expensive.
Why DML is slow in BigQuery:
-
Columnar storage (not row-based)
-
Designed for bulk reads, not individual updates
-
No indexes for fast row lookups
-
Every update rewrites affected partitions
❌ Very slow (row-by-row updates):
-- Don't do this - takes minutes/hours
UPDATE project.dataset.orders
SET status = 'processed'
WHERE order_id = '12345';
-- This is even worse - runs once per row FOR record IN (SELECT order_id FROM orders_to_update) DO UPDATE orders SET status = 'processed' WHERE order_id = record.order_id; END FOR;
⚠️ Better (batch updates):
UPDATE project.dataset.orders
SET status = 'processed'
WHERE order_id IN (SELECT order_id FROM orders_to_update);
✅ Best (recreate table - fastest):
CREATE OR REPLACE TABLE project.dataset.orders AS
SELECT
- EXCEPT(status),
CASE
WHEN order_id IN (SELECT order_id FROM orders_to_update)
THEN 'processed'
ELSE status
END AS status
FROM
project.dataset.orders;
For INSERT/UPSERT - use MERGE:
MERGE project.dataset.customers AS target
USING project.dataset.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, created_at)
VALUES (customer_id, name, CURRENT_TIMESTAMP());
Best practices:
-
Batch updates instead of row-by-row
-
Use MERGE for upserts
-
Consider recreating table for large updates
-
Partition tables to limit update scope
-
Avoid frequent small DML operations
Data Type Gotchas
INT64 is the only integer type:
-- All of these are the same: INT64 CREATE TABLE example ( col1 INT64, -- ✅ Explicit col2 INTEGER, -- Converted to INT64 col3 INT, -- Converted to INT64 col4 SMALLINT, -- Converted to INT64 col5 BIGINT -- Converted to INT64 );
No UUID type - use STRING:
-- PostgreSQL CREATE TABLE users (id UUID);
-- BigQuery CREATE TABLE users (id STRING); -- Store UUID as string
NUMERIC precision limits:
-- NUMERIC: 38 digits precision, 9 decimal places NUMERIC(38, 9)
-- BIGNUMERIC: 76 digits precision, 38 decimal places BIGNUMERIC(76, 38)
-- Example SELECT CAST('12345678901234567890.123456789' AS NUMERIC) AS num, CAST('12345678901234567890.123456789' AS BIGNUMERIC) AS bignum;
TIMESTAMP vs DATETIME vs DATE:
-- TIMESTAMP: UTC, timezone-aware SELECT CURRENT_TIMESTAMP(); -- 2024-01-15 10:30:45.123456 UTC
-- DATETIME: No timezone SELECT CURRENT_DATETIME(); -- 2024-01-15 10:30:45.123456
-- DATE: Date only SELECT CURRENT_DATE(); -- 2024-01-15
-- Conversion SELECT TIMESTAMP('2024-01-15 10:30:45'), -- Assumes UTC DATETIME(TIMESTAMP '2024-01-15 10:30:45'), -- Loses timezone DATE(TIMESTAMP '2024-01-15 10:30:45'); -- Loses time
Type coercion in JOINs:
-- ❌ Implicit cast can prevent optimization SELECT * FROM table1 t1 JOIN table2 t2 ON t1.id = CAST(t2.id AS STRING); -- Prevents clustering optimization
-- ✅ Match types explicitly SELECT * FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id; -- Both STRING
Window Functions
Window functions perform calculations across a set of rows related to the current row, without collapsing the result set.
Basic Syntax
<function> OVER ( [PARTITION BY partition_expression] [ORDER BY sort_expression] [window_frame_clause] )
Ranking Functions
ROW_NUMBER() - Sequential numbering:
SELECT customer_id, order_date, amount, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num FROM orders;
Common use: Deduplication
SELECT * EXCEPT(row_num) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num FROM customers ) WHERE row_num = 1;
RANK() - Rank with gaps:
SELECT product_name, revenue, RANK() OVER (ORDER BY revenue DESC) AS rank FROM products;
-- Results: -- Product A: $1000, rank 1 -- Product B: $900, rank 2 -- Product C: $900, rank 2 (tie) -- Product D: $800, rank 4 (gap after tie)
DENSE_RANK() - Rank without gaps:
SELECT product_name, revenue, DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank FROM products;
-- Results: -- Product A: $1000, rank 1 -- Product B: $900, rank 2 -- Product C: $900, rank 2 (tie) -- Product D: $800, rank 3 (no gap)
NTILE() - Divide into N buckets:
SELECT customer_id, total_spend, NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile FROM customer_totals;
Analytical Functions
LEAD() and LAG() - Access rows before/after:
-- Time series analysis SELECT date, revenue, LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue, LEAD(revenue) OVER (ORDER BY date) AS next_day_revenue, revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change, ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY date)) / LAG(revenue) OVER (ORDER BY date), 2) AS pct_change FROM daily_sales ORDER BY date;
With PARTITION BY:
-- Per-customer analysis SELECT customer_id, order_date, amount, LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount, amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_diff FROM orders;
FIRST_VALUE() and LAST_VALUE():
SELECT date, revenue, FIRST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_day_revenue, LAST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_day_revenue FROM daily_sales;
NTH_VALUE() - Get Nth value:
SELECT product_id, date, sales, NTH_VALUE(sales, 2) OVER (PARTITION BY product_id ORDER BY date) AS second_day_sales FROM product_sales;
Aggregate Window Functions
SUM/AVG/COUNT as window functions:
SELECT date, revenue, SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7day, COUNT(*) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30day_count FROM daily_sales;
Running totals:
SELECT customer_id, order_date, amount, SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value FROM orders;
QUALIFY Clause (BigQuery-Specific)
QUALIFY filters on window function results - no subquery needed!
❌ Standard SQL (verbose):
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num FROM orders ) WHERE row_num = 1;
✅ BigQuery with QUALIFY (clean):
SELECT customer_id, order_date, amount FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;
More QUALIFY examples:
-- Get top 3 products per category SELECT category, product_name, revenue FROM products QUALIFY RANK() OVER (PARTITION BY category ORDER BY revenue DESC) <= 3;
-- Filter outliers (keep middle 80%) SELECT * FROM measurements QUALIFY NTILE(10) OVER (ORDER BY value) BETWEEN 2 AND 9;
-- Get first order per customer SELECT customer_id, order_date, amount FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) = 1;
Window Frame Clauses
Control which rows are included in the window:
-- ROWS: Physical row count ROWS BETWEEN 3 PRECEDING AND CURRENT ROW -- Last 4 rows including current
-- RANGE: Logical range (based on ORDER BY values) RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW -- Last 7 days
-- Examples: SELECT date, sales, -- Last 7 rows AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7rows,
-- Last 7 days (logical) AVG(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW) AS avg_7days,
-- All preceding rows SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum,
-- Centered window (3 before, 3 after) AVG(sales) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS centered_avg FROM daily_sales;
Window Function Performance Tips
- Partition appropriately:
-- ✅ Good: Partitions reduce data scanned SELECT * FROM events WHERE date >= '2024-01-01' -- Partition filter QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;
- Avoid window functions in WHERE:
-- ❌ Wrong: Can't use window functions in WHERE SELECT * FROM orders WHERE ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1; -- ERROR
-- ✅ Use QUALIFY instead SELECT * FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;
- Reuse window definitions:
SELECT date, revenue, ROW_NUMBER() OVER w AS row_num, RANK() OVER w AS rank, AVG(revenue) OVER w AS avg_revenue FROM sales WINDOW w AS (PARTITION BY category ORDER BY revenue DESC);
JSON Functions
BigQuery provides rich functions for parsing, extracting, and manipulating JSON data.
Extracting JSON Values
JSON_VALUE() - Extract scalar values (Standard SQL):
SELECT JSON_VALUE(json_column, '$.user.name') AS user_name, JSON_VALUE(json_column, '$.user.email') AS email, CAST(JSON_VALUE(json_column, '$.amount') AS FLOAT64) AS amount, CAST(JSON_VALUE(json_column, '$.quantity') AS INT64) AS quantity FROM events;
JSON_QUERY() - Extract objects or arrays:
SELECT JSON_QUERY(json_column, '$.user') AS user_object, JSON_QUERY(json_column, '$.items') AS items_array FROM events;
JSON_EXTRACT() - Legacy, still widely used:
SELECT JSON_EXTRACT(json_column, '$.user.name') AS user_name, JSON_EXTRACT_SCALAR(json_column, '$.user.email') AS email -- Returns STRING FROM events;
JSONPath syntax:
-- Dot notation '$.user.name'
-- Array index '$.items[0].product_id'
-- Array slice '$.items[0:3]'
-- Wildcard '$.users[*].name'
-- Recursive descent '$..name' -- All 'name' fields at any level
Working with JSON Arrays
JSON_EXTRACT_ARRAY() - Extract array elements:
SELECT event_id, tag FROM events, UNNEST(JSON_EXTRACT_ARRAY(tags_json, '$')) AS tag;
JSON_VALUE_ARRAY() - Extract array of scalars:
SELECT product_id, tag FROM products, UNNEST(JSON_VALUE_ARRAY(tags_json, '$')) AS tag;
Complete example:
-- JSON: {"product_id": "A123", "tags": ["electronics", "sale", "featured"]} SELECT JSON_VALUE(product_json, '$.product_id') AS product_id, tag FROM products, UNNEST(JSON_VALUE_ARRAY(product_json, '$.tags')) AS tag;
-- Results: -- A123, electronics -- A123, sale -- A123, featured
Creating JSON
TO_JSON_STRING() - Convert to JSON:
SELECT customer_id, TO_JSON_STRING(STRUCT( name, email, created_at )) AS customer_json FROM customers;
Create JSON objects:
SELECT TO_JSON_STRING(STRUCT( 'John' AS name, 30 AS age, ['reading', 'hiking'] AS hobbies, STRUCT('123 Main St' AS street, 'Boston' AS city) AS address )) AS person_json;
-- Result: -- {"name":"John","age":30,"hobbies":["reading","hiking"],"address":{"street":"123 Main St","city":"Boston"}}
Aggregate into JSON:
SELECT customer_id, TO_JSON_STRING( ARRAY_AGG( STRUCT(order_id, amount, date) ORDER BY date DESC LIMIT 5 ) ) AS recent_orders_json FROM orders GROUP BY customer_id;
Parsing JSON Strings
PARSE_JSON() - Convert string to JSON:
SELECT JSON_VALUE(PARSE_JSON('{"name":"Alice","age":25}'), '$.name') AS name;
Safe JSON parsing (avoid errors):
SELECT SAFE.JSON_VALUE(invalid_json, '$.name') AS name -- Returns NULL on error FROM events;
Complex JSON Examples
Nested JSON extraction:
-- JSON structure: -- { -- "order": { -- "id": "ORD123", -- "items": [ -- {"product": "A", "qty": 2, "price": 10.50}, -- {"product": "B", "qty": 1, "price": 25.00} -- ] -- } -- }
SELECT JSON_VALUE(data, '$.order.id') AS order_id, JSON_VALUE(item, '$.product') AS product, CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity, CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS price FROM events, UNNEST(JSON_EXTRACT_ARRAY(data, '$.order.items')) AS item;
Transform relational data to JSON:
SELECT category, TO_JSON_STRING( STRUCT( category AS category_name, COUNT(*) AS product_count, ROUND(AVG(price), 2) AS avg_price, ARRAY_AGG( STRUCT(product_name, price) ORDER BY price DESC LIMIT 3 ) AS top_products ) ) AS category_summary FROM products GROUP BY category;
JSON Performance Tips
- Extract once, reuse:
-- ❌ Bad: Multiple extractions SELECT JSON_VALUE(data, '$.user.id'), JSON_VALUE(data, '$.user.name'), JSON_VALUE(data, '$.user.email') FROM events;
-- ✅ Better: Extract object once WITH extracted AS ( SELECT JSON_QUERY(data, '$.user') AS user_json FROM events ) SELECT JSON_VALUE(user_json, '$.id'), JSON_VALUE(user_json, '$.name'), JSON_VALUE(user_json, '$.email') FROM extracted;
- Consider STRUCT columns instead of JSON:
-- If schema is known and stable, use STRUCT CREATE TABLE events ( user STRUCT<id STRING, name STRING, email STRING>, timestamp TIMESTAMP );
-- Query with dot notation (faster than JSON extraction) SELECT user.id, user.name, user.email FROM events;
- Materialize frequently accessed JSON fields:
-- Add extracted columns to table ALTER TABLE events ADD COLUMN user_id STRING AS (JSON_VALUE(data, '$.user.id'));
-- Now queries can filter efficiently SELECT * FROM events WHERE user_id = 'U123';
ARRAY and STRUCT
BigQuery's native support for nested and repeated data allows for powerful denormalization and performance optimization.
ARRAY Basics
Creating arrays:
SELECT [1, 2, 3, 4, 5] AS numbers, ['apple', 'banana', 'cherry'] AS fruits, [DATE '2024-01-01', DATE '2024-01-02'] AS dates;
ARRAY_AGG() - Aggregate into array:
SELECT customer_id, ARRAY_AGG(order_id ORDER BY order_date DESC) AS order_ids, ARRAY_AGG(amount) AS order_amounts FROM orders GROUP BY customer_id;
With LIMIT:
SELECT customer_id, ARRAY_AGG(order_id ORDER BY order_date DESC LIMIT 5) AS recent_order_ids FROM orders GROUP BY customer_id;
UNNEST - Flattening Arrays
Basic UNNEST:
SELECT element FROM UNNEST(['a', 'b', 'c']) AS element;
-- Results: -- a -- b -- c
UNNEST with table:
-- Table: customers -- customer_id | order_ids -- 1 | [101, 102, 103] -- 2 | [201, 202]
SELECT customer_id, order_id FROM customers, UNNEST(order_ids) AS order_id;
-- Results: -- 1, 101 -- 1, 102 -- 1, 103 -- 2, 201 -- 2, 202
UNNEST with OFFSET (get array index):
SELECT item, idx FROM UNNEST(['first', 'second', 'third']) AS item WITH OFFSET AS idx;
-- Results: -- first, 0 -- second, 1 -- third, 2
STRUCT - Nested Records
Creating structs:
SELECT STRUCT('John' AS name, 30 AS age, 'Engineer' AS role) AS person, STRUCT('123 Main St' AS street, 'Boston' AS city, '02101' AS zip) AS address;
Querying struct fields:
SELECT person.name, person.age, address.city FROM ( SELECT STRUCT('John' AS name, 30 AS age) AS person, STRUCT('Boston' AS city) AS address );
ARRAY of STRUCT (Most Powerful Pattern)
Create:
SELECT customer_id, ARRAY_AGG( STRUCT( order_id, amount, order_date, status ) ORDER BY order_date DESC ) AS orders FROM orders GROUP BY customer_id;
Query:
-- Flatten array of struct SELECT customer_id, order.order_id, order.amount, order.order_date FROM customers, UNNEST(orders) AS order WHERE order.status = 'completed';
Filter array elements:
SELECT customer_id, ARRAY( SELECT AS STRUCT order_id, amount FROM UNNEST(orders) AS order WHERE order.status = 'completed' ORDER BY amount DESC LIMIT 3 ) AS top_completed_orders FROM customers;
ARRAY Functions
ARRAY_LENGTH():
SELECT customer_id, ARRAY_LENGTH(order_ids) AS total_orders FROM customers;
ARRAY_CONCAT():
SELECT ARRAY_CONCAT([1, 2], [3, 4], [5]) AS combined; -- Result: [1, 2, 3, 4, 5]
ARRAY_TO_STRING():
SELECT ARRAY_TO_STRING(['apple', 'banana', 'cherry'], ', ') AS fruits; -- Result: 'apple, banana, cherry'
GENERATE_ARRAY():
SELECT GENERATE_ARRAY(1, 10) AS numbers; -- Result: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
SELECT GENERATE_ARRAY(0, 100, 10) AS multiples_of_10; -- Result: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
ARRAY_REVERSE():
SELECT ARRAY_REVERSE([1, 2, 3, 4, 5]) AS reversed; -- Result: [5, 4, 3, 2, 1]
Performance: ARRAY vs JOIN
Traditional approach (2 tables, JOIN):
-- Table 1: customers (1M rows) -- Table 2: orders (10M rows)
SELECT c.customer_id, c.name, o.order_id, o.amount FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE c.customer_id = '12345';
-- Scans: customers (1M) + orders (10M) = 11M rows -- Join cost: High
Array approach (1 table with ARRAY):
-- Table: customers (1M rows with nested orders array)
SELECT customer_id, name, order.order_id, order.amount FROM customers, UNNEST(orders) AS order WHERE customer_id = '12345';
-- Scans: customers (1M) only -- No join cost -- 50-80% faster for 1:many relationships
When to use ARRAY:
-
1:many relationships (orders per customer)
-
Moderate array size (< 1000 elements)
-
Frequent filtering by parent entity
-
Want to reduce JOINs
When NOT to use ARRAY:
-
Many:many relationships
-
Very large arrays (> 10,000 elements)
-
Need to query array elements independently
-
Array elements frequently updated
Complete Example: Denormalized Design
Traditional normalized:
-- 3 tables, 2 JOINs SELECT c.customer_id, c.name, o.order_id, oi.product_id, oi.quantity FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id;
Denormalized with ARRAY/STRUCT:
-- 1 table, no JOINs CREATE TABLE customers_denormalized AS SELECT c.customer_id, c.name, ARRAY_AGG( STRUCT( o.order_id, o.order_date, o.status, ARRAY( SELECT AS STRUCT product_id, quantity, price FROM order_items WHERE order_id = o.order_id ) AS items ) ORDER BY o.order_date DESC ) AS orders FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.name;
-- Query (no JOINs!) SELECT customer_id, name, order.order_id, item.product_id, item.quantity FROM customers_denormalized, UNNEST(orders) AS order, UNNEST(order.items) AS item WHERE customer_id = '12345';
Performance improvement: 3-5x faster for typical queries.
BigQuery-Specific Features
EXCEPT and REPLACE in SELECT
EXCEPT - Exclude columns:
-- Select all except sensitive columns SELECT * EXCEPT(ssn, password, credit_card) FROM customers;
-- Combine with WHERE SELECT * EXCEPT(internal_notes) FROM orders WHERE status = 'shipped';
REPLACE - Modify columns:
-- Replace column values SELECT * REPLACE(UPPER(name) AS name, ROUND(price, 2) AS price) FROM products;
-- Anonymize data SELECT * REPLACE('' AS ssn, '' AS credit_card) FROM customers;
Combine EXCEPT and REPLACE:
SELECT * EXCEPT(password) REPLACE(LOWER(email) AS email) FROM users;
SAFE Functions (NULL Instead of Errors)
SAFE_CAST() - Returns NULL on error:
-- Regular CAST throws error on invalid input SELECT CAST('invalid' AS INT64); -- ERROR
-- SAFE_CAST returns NULL SELECT SAFE_CAST('invalid' AS INT64) AS result; -- NULL
SAFE_DIVIDE() - Returns NULL on division by zero:
SELECT revenue, orders, SAFE_DIVIDE(revenue, orders) AS avg_order_value -- NULL if orders = 0 FROM daily_metrics;
Other SAFE functions:
-- SAFE_SUBTRACT (for dates) SELECT SAFE_SUBTRACT(DATE '2024-01-01', DATE '2024-12-31'); -- NULL (negative)
-- SAFE_NEGATE SELECT SAFE_NEGATE(9223372036854775807); -- NULL (overflow)
-- SAFE_ADD SELECT SAFE_ADD(9223372036854775807, 1); -- NULL (overflow)
Use case: Data quality checks:
SELECT COUNT() AS total_rows, COUNT(SAFE_CAST(amount AS FLOAT64)) AS valid_amounts, COUNT() - COUNT(SAFE_CAST(amount AS FLOAT64)) AS invalid_amounts FROM transactions;
GROUP BY with Column Numbers
-- ✅ Valid: Group by column position SELECT customer_id, DATE(order_date) AS order_date, SUM(amount) AS total FROM orders GROUP BY 1, 2; -- Same as: GROUP BY customer_id, DATE(order_date)
-- ✅ Also valid: Mix names and numbers SELECT customer_id, DATE(order_date) AS order_date, SUM(amount) AS total FROM orders GROUP BY customer_id, 2;
When useful:
-
Long expressions in SELECT
-
Complex CASE statements
-
Simplifies GROUP BY clause
TABLESAMPLE - Random Sampling
System sampling (fast, approximate):
-- Sample ~10% of data (by blocks) SELECT * FROM large_table TABLESAMPLE SYSTEM (10 PERCENT);
Use cases:
-
Quick data exploration
-
Testing queries on large tables
-
Statistical sampling
Note: SYSTEM sampling is block-based, not truly random. For exact percentages, use ROW_NUMBER() with RAND().
PIVOT and UNPIVOT
PIVOT - Columns to rows:
SELECT * FROM ( SELECT product, quarter, sales FROM quarterly_sales ) PIVOT ( SUM(sales) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4') );
-- Before: -- product | quarter | sales -- A | Q1 | 100 -- A | Q2 | 150
-- After: -- product | Q1 | Q2 | Q3 | Q4 -- A | 100 | 150 | ... | ...
UNPIVOT - Rows to columns:
SELECT * FROM quarterly_totals UNPIVOT ( sales FOR quarter IN (Q1, Q2, Q3, Q4) );
-- Before: -- product | Q1 | Q2 | Q3 | Q4 -- A | 100 | 150 | 200 | 250
-- After: -- product | quarter | sales -- A | Q1 | 100 -- A | Q2 | 150
Standard SQL vs Legacy SQL
BigQuery has two SQL dialects:
Feature Standard SQL Legacy SQL
ANSI Compliance ✅ Yes ❌ No
Recommended ✅ Yes ❌ Deprecated
Table Reference
project.dataset.table
[project:dataset.table]
CTEs (WITH) ✅ Yes ❌ No
Window Functions ✅ Full support ⚠️ Limited
ARRAY/STRUCT ✅ Native ⚠️ Limited
Portability ✅ High ❌ BigQuery-only
How to detect Legacy SQL:
-- Legacy SQL indicators: -- 1. Square brackets for tables SELECT * FROM [project:dataset.table]
-- 2. GROUP EACH BY SELECT customer_id, COUNT(*) FROM orders GROUP EACH BY customer_id
-- 3. FLATTEN SELECT * FROM FLATTEN([project:dataset.table], field)
-- 4. TABLE_DATE_RANGE SELECT * FROM TABLE_DATE_RANGE([dataset.table_], TIMESTAMP('2024-01-01'), TIMESTAMP('2024-12-31'))
Standard SQL equivalent:
-- 1. Backticks
SELECT * FROM project.dataset.table
-- 2. Regular GROUP BY SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id
-- 3. UNNEST
SELECT * FROM project.dataset.table, UNNEST(field)
-- 4. Partitioned table filter
SELECT * FROM project.dataset.table
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2024-01-01') AND TIMESTAMP('2024-12-31')
Migration:
Set default to Standard SQL
bq query --use_legacy_sql=false 'SELECT ...'
Or in Python
job_config = bigquery.QueryJobConfig(use_legacy_sql=False)
Performance Checklist
Before running expensive queries:
-
☐ Use --dry_run to estimate cost
-
☐ Check if partition pruning is active
-
☐ Verify clustering on JOIN/WHERE columns
-
☐ Remove SELECT *
-
☐ Filter early with WHERE
-
☐ Use EXPLAIN to analyze plan
-
☐ Consider materialized views for repeated queries
-
☐ Test with LIMIT first on full query
-
☐ Avoid CTE re-execution (use temp tables if referenced 2+ times)
-
☐ Use NOT EXISTS instead of NOT IN
-
☐ Batch DML operations (avoid row-by-row updates)
-
☐ Consider ARRAY/STRUCT for 1:many relationships
Monitoring Query Performance
-- Check query statistics
SELECT
job_id,
user_email,
total_bytes_processed,
total_slot_ms,
TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec
FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 10;
Quick Wins
Immediate improvements:
-
Add partition filter → 50-90% cost reduction
-
Remove SELECT * → 30-70% cost reduction
-
Use clustering → 20-50% performance improvement
-
Break complex queries → 2-5x faster execution
-
Enable query cache → Free repeated queries