BigQuery Table Management

Use this skill when creating, modifying, or optimizing BigQuery table structures with partitioning and clustering.

Creating Tables

Basic Table Creation

Using bq mk:

bq mk -t
--schema 'customer_id:STRING,amount:FLOAT,date:DATE'
--description "Customer orders table"
project:dataset.orders

Using SQL DDL:

CREATE TABLE project.dataset.orders ( customer_id STRING, amount FLOAT64, date DATE, created_at TIMESTAMP );

Partitioning Strategies

Time-Based Partitioning

Create time-partitioned table:

bq mk -t
--schema 'timestamp:TIMESTAMP,customer_id:STRING,amount:FLOAT'
--time_partitioning_field timestamp
--time_partitioning_type DAY
project:dataset.orders

SQL DDL version:

CREATE TABLE project.dataset.orders ( timestamp TIMESTAMP, customer_id STRING, amount FLOAT64 ) PARTITION BY DATE(timestamp);

Partitioning options:

DAY - Daily partitions (most common)
HOUR - Hourly partitions (for high-volume data)
MONTH - Monthly partitions (for historical data)
YEAR - Yearly partitions (for very old data)

Ingestion-Time Partitioning

Create table with automatic _PARTITIONTIME:

bq mk -t
--schema 'customer_id:STRING,amount:FLOAT'
--time_partitioning_type DAY
project:dataset.orders

Query with ingestion-time partition:

SELECT * FROM project.dataset.orders WHERE _PARTITIONTIME >= '2024-01-01'

Range Partitioning

Create range-partitioned table:

bq mk -t
--schema 'customer_id:INTEGER,region:STRING,sales:FLOAT'
--range_partitioning=customer_id,0,100,10
project:dataset.sales

Parameters: field,start,end,interval

Creates partitions: [0,10), [10,20), [20,30), ..., [90,100)

SQL DDL version:

CREATE TABLE project.dataset.sales ( customer_id INT64, region STRING, sales FLOAT64 ) PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10));

Clustering

Basic Clustering

Create clustered table:

bq mk -t
--schema 'timestamp:TIMESTAMP,customer_id:STRING,product_id:STRING,amount:FLOAT'
--clustering_fields customer_id,product_id
project:dataset.orders

SQL DDL version:

CREATE TABLE project.dataset.orders ( timestamp TIMESTAMP, customer_id STRING, product_id STRING, amount FLOAT64 ) CLUSTER BY customer_id, product_id;

Clustering rules:

Up to 4 clustering columns
Order matters (first column has most impact)
Works best with WHERE, GROUP BY, JOIN filters

Partitioning + Clustering (Recommended)

Combined approach:

bq mk -t
--schema 'timestamp:TIMESTAMP,customer_id:STRING,transaction_amount:FLOAT'
--time_partitioning_field timestamp
--clustering_fields customer_id
--description "Partitioned by day, clustered by customer"
project:dataset.transactions

SQL DDL version:

CREATE TABLE project.dataset.transactions ( timestamp TIMESTAMP, customer_id STRING, transaction_amount FLOAT64 ) PARTITION BY DATE(timestamp) CLUSTER BY customer_id;

Query benefits:

-- Partition pruning + clustering optimization SELECT * FROM project.dataset.transactions WHERE DATE(timestamp) = '2024-01-15' -- Partition filter AND customer_id = 'CUST123' -- Cluster filter

Table Configuration Options

Expiration

Set table expiration:

bq mk -t
--expiration 2592000
--schema 'field:TYPE'
project:dataset.temp_table

Expiration in seconds: 2592000 = 30 days

Update existing table:

bq update --expiration 604800 project:dataset.table

Remove expiration:

bq update --expiration 0 project:dataset.table

Labels

Add labels:

bq mk -t
--schema 'field:TYPE'
--label environment:production
--label team:analytics
project:dataset.table

Update labels:

bq update --set_label environment:staging project:dataset.table

Description

Set description:

bq update
--description "Customer transaction history with daily partitioning"
project:dataset.transactions

Schema Management

Adding Columns

Cannot add required columns to existing data:

Add optional column

bq query --use_legacy_sql=false
'ALTER TABLE project.dataset.table ADD COLUMN new_field STRING'

Changing Column Modes

REQUIRED → NULLABLE (allowed):

ALTER TABLE project.dataset.table ALTER COLUMN field_name DROP NOT NULL;

NULLABLE → REQUIRED (NOT allowed if data exists)

Relaxing Column Types

Allowed type changes:

INT64 → FLOAT64 ✅
INT64 → NUMERIC ✅
INT64 → BIGNUMERIC ✅
INT64 → STRING ✅

Example:

ALTER TABLE project.dataset.table ALTER COLUMN amount SET DATA TYPE FLOAT64;

External Tables

Create External Table (GCS)

CSV in GCS:

bq mk
--external_table_definition=gs://bucket/*.csv@CSV
--schema='customer_id:STRING,amount:FLOAT'
project:dataset.external_orders

Parquet in GCS (schema auto-detected):

bq mk
--external_table_definition=gs://bucket/*.parquet@PARQUET
project:dataset.external_data

Supported formats: CSV, JSON, AVRO, PARQUET, ORC

External Table Limitations

No DML operations (INSERT, UPDATE, DELETE)
No guaranteed performance SLAs
Data must be in GCS, Drive, or Bigtable
Cannot be partitioned (but can use hive partitioning)

Snapshots and Clones

Table Snapshots

Create snapshot:

CREATE SNAPSHOT TABLE project.dataset.orders_snapshot CLONE project.dataset.orders;

Restore from snapshot:

CREATE OR REPLACE TABLE project.dataset.orders CLONE project.dataset.orders_snapshot;

Snapshot retention: 7 days by default

Table Clones

Create table clone:

CREATE TABLE project.dataset.orders_clone CLONE project.dataset.orders;

Difference from snapshot:

Clone = new independent table
Snapshot = point-in-time reference

Time Travel

Query historical data:

-- Query table as it was 1 hour ago SELECT * FROM project.dataset.orders FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);

-- Query table at specific time SELECT * FROM project.dataset.orders FOR SYSTEM_TIME AS OF '2024-01-15 10:00:00 UTC';

Time travel window: 7 days (168 hours) by default

Best Practices

Partition Selection

Use time-based partitioning when:

Data has timestamp/date column
Queries filter by time ranges
Data arrives chronologically
Want automatic partition management

Use range partitioning when:

Partitioning on integer column (ID, age, etc.)
Predictable value distribution
Fixed range boundaries

Use ingestion-time partitioning when:

No natural timestamp column
Loading data from streaming sources
Want simple partition management

Clustering Selection

Cluster on columns that are:

Frequently used in WHERE clauses
Used in JOIN conditions
Used in GROUP BY
High cardinality (many distinct values)

Order matters:

Most filtered column first
Then second most filtered
Up to 4 columns total

Partitioning + Clustering Strategy

Optimal pattern:

CREATE TABLE project.dataset.optimized ( event_timestamp TIMESTAMP, -- Partition on this customer_id STRING, -- Cluster on this (1st) product_category STRING, -- Cluster on this (2nd) amount FLOAT64 ) PARTITION BY DATE(event_timestamp) CLUSTER BY customer_id, product_category;

Query pattern:

-- Both partition and cluster benefit SELECT SUM(amount) FROM project.dataset.optimized WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31' AND customer_id = 'CUST123' GROUP BY product_category;

Checking Table Metadata

Get table information:

bq show --format=prettyjson project:dataset.table

Check partition info:

SELECT partition_id, total_rows, total_logical_bytes, last_modified_time FROM project.dataset.INFORMATION_SCHEMA.PARTITIONS WHERE table_name = 'orders' ORDER BY partition_id DESC LIMIT 10;

Check clustering info:

SELECT table_name, clustering_ordinal_position, column_name FROM project.dataset.INFORMATION_SCHEMA.COLUMNS WHERE clustering_ordinal_position IS NOT NULL ORDER BY table_name, clustering_ordinal_position;

Common Pitfalls

❌ Too many partitions

Problem: Creating 100,000+ partitions Limit: 10,000 partitions per table Solution: Use larger partition granularity (MONTH vs DAY)

❌ Wrong partition column

Problem: Partitioning on column not used in queries Solution: Partition on columns used in WHERE clauses

❌ Not filtering on partition

Problem: Query scans all partitions Solution: Always include partition filter in WHERE

❌ Clustering too many columns

Problem: Clustering on 5+ columns Limit: 4 columns maximum Solution: Choose most selective columns

❌ Wrong cluster order

Problem: Least selective column first Solution: Put most selective column first

Table Maintenance

Update partition expiration

Set partition expiration:

bq update
--time_partitioning_expiration 2592000
project:dataset.partitioned_table

Query shows this: Partitions older than 30 days auto-delete

Optimize table storage

Run optimization query:

-- BigQuery automatically optimizes storage -- No manual VACUUM or OPTIMIZE needed

BigQuery automatically:

Compacts data
Sorts by clustering columns
Removes deleted rows
Optimizes storage format

Access Control & Security

Row-Level Security

Row-level access policies filter data based on user/group membership. They coexist with column-level security.

Creating row-level policies:

CREATE ROW ACCESS POLICY policy_name ON dataset.table GRANT TO ("user:[email protected]") FILTER USING (region = "US");

Multiple policies:

-- Policy for US users CREATE ROW ACCESS POLICY us_users_policy ON dataset.orders GRANT TO ("group:[email protected]") FILTER USING (region = "US");

-- Policy for managers (see all regions) CREATE ROW ACCESS POLICY managers_policy ON dataset.orders GRANT TO ("group:[email protected]") FILTER USING (TRUE);

Viewing policies:

SELECT * FROM dataset.INFORMATION_SCHEMA.ROW_ACCESS_POLICIES WHERE table_name = 'orders';

Dropping policies:

DROP ROW ACCESS POLICY policy_name ON dataset.table;

Column-Level Security

Use policy tags from Data Catalog to restrict access to sensitive columns:

Creating table with policy tags:

CREATE TABLE dataset.customers ( customer_id STRING, name STRING, email STRING, ssn STRING OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG") ), credit_score INT64 OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/SENSITIVE_TAG") ) );

Adding policy tags to existing columns:

ALTER TABLE dataset.customers ALTER COLUMN ssn SET OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG") );

How it works:

Create taxonomy and policy tags in Data Catalog
Apply policy tags to table columns
Grant IAM roles on policy tags (datacatalog.categoryFineGrainedReader)
Users without permission cannot query those columns

Authorized Views

Views that allow users to query data without direct table access:

Use cases:

Sharing specific columns/rows without full table access
Implementing business logic in access control
Best performance for row/column filtering

Setup process:

-- 1. Create view in dataset A CREATE VIEW datasetA.public_orders AS SELECT order_id, customer_id, amount, order_date FROM datasetA.orders WHERE status = 'completed';

-- 2. Grant dataset B's view access to dataset A's table -- This is done via dataset permissions in Cloud Console or: bq update --source datasetA.orders
--view datasetB.public_view

Example authorized view:

-- View in shared_views dataset CREATE VIEW shared_views.customer_summary AS SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spent FROM private_data.orders GROUP BY customer_id;

-- Grant access to view (not underlying table) -- Users can query shared_views.customer_summary -- but cannot access private_data.orders

Benefits:

Row/column filtering without policy overhead
Business logic in SQL (e.g., only show completed orders)
Best query performance
Centralized access control

Security Best Practices

Layered security:

Use row-level policies for user-based filtering
Use column-level security for sensitive data (PII, PHI)
Use authorized views for complex access patterns

Performance:

Authorized views: Best performance (compiled into query)
Row-level policies: Slight overhead (filter applied)
Column-level: No performance impact

Combining approaches:

-- Table with column-level security AND row-level policy CREATE TABLE dataset.sensitive_data ( user_id STRING, region STRING, ssn STRING OPTIONS(policy_tags=("...")), data JSON ) PARTITION BY DATE(created_at);

-- Row-level policy CREATE ROW ACCESS POLICY regional_access ON dataset.sensitive_data GRANT TO ("group:[email protected]") FILTER USING (region = "US");

Auditing: Monitor access with Cloud Audit Logs:

SELECT timestamp, principal_email, resource_name, method_name FROM PROJECT.DATASET.cloudaudit_googleapis_com_data_access_* WHERE resource.type = "bigquery_dataset" ORDER BY timestamp DESC;

Quick Reference

Partition types:

Time-based: HOUR, DAY, MONTH, YEAR
Ingestion-time: Automatic _PARTITIONTIME
Range: Integer column ranges

Clustering:

Max 4 columns
Order matters
Works with or without partitioning

Security:

Row-level: Filter by user/group
Column-level: Policy tags for sensitive data
Authorized views: Business logic filtering

Limits:

10,000 partitions per table
4 clustering columns
7-day time travel window
10,000 columns per table
100 row-level policies per table

gcp-bq-table-management

Safety Notice

Copy this and send it to your AI assistant to learn

Add optional column

Source Transparency

Related Skills

n8n-workflow

bq-query-optimization

playwright-browser

hubspot-crm