BigQuery Table Management
Use this skill when creating, modifying, or optimizing BigQuery table structures with partitioning and clustering.
Creating Tables
Basic Table Creation
Using bq mk:
bq mk -t
--schema 'customer_id:STRING,amount:FLOAT,date:DATE'
--description "Customer orders table"
project:dataset.orders
Using SQL DDL:
CREATE TABLE project.dataset.orders (
customer_id STRING,
amount FLOAT64,
date DATE,
created_at TIMESTAMP
);
Partitioning Strategies
Time-Based Partitioning
Create time-partitioned table:
bq mk -t
--schema 'timestamp:TIMESTAMP,customer_id:STRING,amount:FLOAT'
--time_partitioning_field timestamp
--time_partitioning_type DAY
project:dataset.orders
SQL DDL version:
CREATE TABLE project.dataset.orders (
timestamp TIMESTAMP,
customer_id STRING,
amount FLOAT64
)
PARTITION BY DATE(timestamp);
Partitioning options:
-
DAY - Daily partitions (most common)
-
HOUR - Hourly partitions (for high-volume data)
-
MONTH - Monthly partitions (for historical data)
-
YEAR - Yearly partitions (for very old data)
Ingestion-Time Partitioning
Create table with automatic _PARTITIONTIME:
bq mk -t
--schema 'customer_id:STRING,amount:FLOAT'
--time_partitioning_type DAY
project:dataset.orders
Query with ingestion-time partition:
SELECT * FROM project.dataset.orders
WHERE _PARTITIONTIME >= '2024-01-01'
Range Partitioning
Create range-partitioned table:
bq mk -t
--schema 'customer_id:INTEGER,region:STRING,sales:FLOAT'
--range_partitioning=customer_id,0,100,10
project:dataset.sales
Parameters: field,start,end,interval
- Creates partitions: [0,10), [10,20), [20,30), ..., [90,100)
SQL DDL version:
CREATE TABLE project.dataset.sales (
customer_id INT64,
region STRING,
sales FLOAT64
)
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10));
Clustering
Basic Clustering
Create clustered table:
bq mk -t
--schema 'timestamp:TIMESTAMP,customer_id:STRING,product_id:STRING,amount:FLOAT'
--clustering_fields customer_id,product_id
project:dataset.orders
SQL DDL version:
CREATE TABLE project.dataset.orders (
timestamp TIMESTAMP,
customer_id STRING,
product_id STRING,
amount FLOAT64
)
CLUSTER BY customer_id, product_id;
Clustering rules:
-
Up to 4 clustering columns
-
Order matters (first column has most impact)
-
Works best with WHERE, GROUP BY, JOIN filters
Partitioning + Clustering (Recommended)
Combined approach:
bq mk -t
--schema 'timestamp:TIMESTAMP,customer_id:STRING,transaction_amount:FLOAT'
--time_partitioning_field timestamp
--clustering_fields customer_id
--description "Partitioned by day, clustered by customer"
project:dataset.transactions
SQL DDL version:
CREATE TABLE project.dataset.transactions (
timestamp TIMESTAMP,
customer_id STRING,
transaction_amount FLOAT64
)
PARTITION BY DATE(timestamp)
CLUSTER BY customer_id;
Query benefits:
-- Partition pruning + clustering optimization
SELECT * FROM project.dataset.transactions
WHERE DATE(timestamp) = '2024-01-15' -- Partition filter
AND customer_id = 'CUST123' -- Cluster filter
Table Configuration Options
Expiration
Set table expiration:
bq mk -t
--expiration 2592000
--schema 'field:TYPE'
project:dataset.temp_table
Expiration in seconds: 2592000 = 30 days
Update existing table:
bq update --expiration 604800 project:dataset.table
Remove expiration:
bq update --expiration 0 project:dataset.table
Labels
Add labels:
bq mk -t
--schema 'field:TYPE'
--label environment:production
--label team:analytics
project:dataset.table
Update labels:
bq update --set_label environment:staging project:dataset.table
Description
Set description:
bq update
--description "Customer transaction history with daily partitioning"
project:dataset.transactions
Schema Management
Adding Columns
Cannot add required columns to existing data:
Add optional column
bq query --use_legacy_sql=false
'ALTER TABLE project.dataset.table
ADD COLUMN new_field STRING'
Changing Column Modes
REQUIRED → NULLABLE (allowed):
ALTER TABLE project.dataset.table
ALTER COLUMN field_name DROP NOT NULL;
NULLABLE → REQUIRED (NOT allowed if data exists)
Relaxing Column Types
Allowed type changes:
-
INT64 → FLOAT64 ✅
-
INT64 → NUMERIC ✅
-
INT64 → BIGNUMERIC ✅
-
INT64 → STRING ✅
Example:
ALTER TABLE project.dataset.table
ALTER COLUMN amount SET DATA TYPE FLOAT64;
External Tables
Create External Table (GCS)
CSV in GCS:
bq mk
--external_table_definition=gs://bucket/*.csv@CSV
--schema='customer_id:STRING,amount:FLOAT'
project:dataset.external_orders
Parquet in GCS (schema auto-detected):
bq mk
--external_table_definition=gs://bucket/*.parquet@PARQUET
project:dataset.external_data
Supported formats: CSV, JSON, AVRO, PARQUET, ORC
External Table Limitations
-
No DML operations (INSERT, UPDATE, DELETE)
-
No guaranteed performance SLAs
-
Data must be in GCS, Drive, or Bigtable
-
Cannot be partitioned (but can use hive partitioning)
Snapshots and Clones
Table Snapshots
Create snapshot:
CREATE SNAPSHOT TABLE project.dataset.orders_snapshot
CLONE project.dataset.orders;
Restore from snapshot:
CREATE OR REPLACE TABLE project.dataset.orders
CLONE project.dataset.orders_snapshot;
Snapshot retention: 7 days by default
Table Clones
Create table clone:
CREATE TABLE project.dataset.orders_clone
CLONE project.dataset.orders;
Difference from snapshot:
-
Clone = new independent table
-
Snapshot = point-in-time reference
Time Travel
Query historical data:
-- Query table as it was 1 hour ago
SELECT * FROM project.dataset.orders
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
-- Query table at specific time
SELECT * FROM project.dataset.orders
FOR SYSTEM_TIME AS OF '2024-01-15 10:00:00 UTC';
Time travel window: 7 days (168 hours) by default
Best Practices
Partition Selection
Use time-based partitioning when:
-
Data has timestamp/date column
-
Queries filter by time ranges
-
Data arrives chronologically
-
Want automatic partition management
Use range partitioning when:
-
Partitioning on integer column (ID, age, etc.)
-
Predictable value distribution
-
Fixed range boundaries
Use ingestion-time partitioning when:
-
No natural timestamp column
-
Loading data from streaming sources
-
Want simple partition management
Clustering Selection
Cluster on columns that are:
-
Frequently used in WHERE clauses
-
Used in JOIN conditions
-
Used in GROUP BY
-
High cardinality (many distinct values)
Order matters:
-
Most filtered column first
-
Then second most filtered
-
Up to 4 columns total
Partitioning + Clustering Strategy
Optimal pattern:
CREATE TABLE project.dataset.optimized (
event_timestamp TIMESTAMP, -- Partition on this
customer_id STRING, -- Cluster on this (1st)
product_category STRING, -- Cluster on this (2nd)
amount FLOAT64
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY customer_id, product_category;
Query pattern:
-- Both partition and cluster benefit
SELECT SUM(amount)
FROM project.dataset.optimized
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
AND customer_id = 'CUST123'
GROUP BY product_category;
Checking Table Metadata
Get table information:
bq show --format=prettyjson project:dataset.table
Check partition info:
SELECT
partition_id,
total_rows,
total_logical_bytes,
last_modified_time
FROM project.dataset.INFORMATION_SCHEMA.PARTITIONS
WHERE table_name = 'orders'
ORDER BY partition_id DESC
LIMIT 10;
Check clustering info:
SELECT
table_name,
clustering_ordinal_position,
column_name
FROM project.dataset.INFORMATION_SCHEMA.COLUMNS
WHERE clustering_ordinal_position IS NOT NULL
ORDER BY table_name, clustering_ordinal_position;
Common Pitfalls
❌ Too many partitions
Problem: Creating 100,000+ partitions Limit: 10,000 partitions per table Solution: Use larger partition granularity (MONTH vs DAY)
❌ Wrong partition column
Problem: Partitioning on column not used in queries Solution: Partition on columns used in WHERE clauses
❌ Not filtering on partition
Problem: Query scans all partitions Solution: Always include partition filter in WHERE
❌ Clustering too many columns
Problem: Clustering on 5+ columns Limit: 4 columns maximum Solution: Choose most selective columns
❌ Wrong cluster order
Problem: Least selective column first Solution: Put most selective column first
Table Maintenance
Update partition expiration
Set partition expiration:
bq update
--time_partitioning_expiration 2592000
project:dataset.partitioned_table
Query shows this: Partitions older than 30 days auto-delete
Optimize table storage
Run optimization query:
-- BigQuery automatically optimizes storage -- No manual VACUUM or OPTIMIZE needed
BigQuery automatically:
-
Compacts data
-
Sorts by clustering columns
-
Removes deleted rows
-
Optimizes storage format
Access Control & Security
Row-Level Security
Row-level access policies filter data based on user/group membership. They coexist with column-level security.
Creating row-level policies:
CREATE ROW ACCESS POLICY policy_name ON dataset.table GRANT TO ("user:[email protected]") FILTER USING (region = "US");
Multiple policies:
-- Policy for US users CREATE ROW ACCESS POLICY us_users_policy ON dataset.orders GRANT TO ("group:[email protected]") FILTER USING (region = "US");
-- Policy for managers (see all regions) CREATE ROW ACCESS POLICY managers_policy ON dataset.orders GRANT TO ("group:[email protected]") FILTER USING (TRUE);
Viewing policies:
SELECT * FROM dataset.INFORMATION_SCHEMA.ROW_ACCESS_POLICIES WHERE table_name = 'orders';
Dropping policies:
DROP ROW ACCESS POLICY policy_name ON dataset.table;
Column-Level Security
Use policy tags from Data Catalog to restrict access to sensitive columns:
Creating table with policy tags:
CREATE TABLE dataset.customers ( customer_id STRING, name STRING, email STRING, ssn STRING OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG") ), credit_score INT64 OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/SENSITIVE_TAG") ) );
Adding policy tags to existing columns:
ALTER TABLE dataset.customers ALTER COLUMN ssn SET OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG") );
How it works:
-
Create taxonomy and policy tags in Data Catalog
-
Apply policy tags to table columns
-
Grant IAM roles on policy tags (datacatalog.categoryFineGrainedReader)
-
Users without permission cannot query those columns
Authorized Views
Views that allow users to query data without direct table access:
Use cases:
-
Sharing specific columns/rows without full table access
-
Implementing business logic in access control
-
Best performance for row/column filtering
Setup process:
-- 1. Create view in dataset A CREATE VIEW datasetA.public_orders AS SELECT order_id, customer_id, amount, order_date FROM datasetA.orders WHERE status = 'completed';
-- 2. Grant dataset B's view access to dataset A's table
-- This is done via dataset permissions in Cloud Console or:
bq update --source datasetA.orders
--view datasetB.public_view
Example authorized view:
-- View in shared_views dataset CREATE VIEW shared_views.customer_summary AS SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spent FROM private_data.orders GROUP BY customer_id;
-- Grant access to view (not underlying table) -- Users can query shared_views.customer_summary -- but cannot access private_data.orders
Benefits:
-
Row/column filtering without policy overhead
-
Business logic in SQL (e.g., only show completed orders)
-
Best query performance
-
Centralized access control
Security Best Practices
- Layered security:
-
Use row-level policies for user-based filtering
-
Use column-level security for sensitive data (PII, PHI)
-
Use authorized views for complex access patterns
- Performance:
-
Authorized views: Best performance (compiled into query)
-
Row-level policies: Slight overhead (filter applied)
-
Column-level: No performance impact
- Combining approaches:
-- Table with column-level security AND row-level policy CREATE TABLE dataset.sensitive_data ( user_id STRING, region STRING, ssn STRING OPTIONS(policy_tags=("...")), data JSON ) PARTITION BY DATE(created_at);
-- Row-level policy CREATE ROW ACCESS POLICY regional_access ON dataset.sensitive_data GRANT TO ("group:[email protected]") FILTER USING (region = "US");
- Auditing: Monitor access with Cloud Audit Logs:
SELECT
timestamp,
principal_email,
resource_name,
method_name
FROM PROJECT.DATASET.cloudaudit_googleapis_com_data_access_*
WHERE resource.type = "bigquery_dataset"
ORDER BY timestamp DESC;
Quick Reference
Partition types:
-
Time-based: HOUR, DAY, MONTH, YEAR
-
Ingestion-time: Automatic _PARTITIONTIME
-
Range: Integer column ranges
Clustering:
-
Max 4 columns
-
Order matters
-
Works with or without partitioning
Security:
-
Row-level: Filter by user/group
-
Column-level: Policy tags for sensitive data
-
Authorized views: Business logic filtering
Limits:
-
10,000 partitions per table
-
4 clustering columns
-
7-day time travel window
-
10,000 columns per table
-
100 row-level policies per table