developing-incremental-models

dbt Incremental Model Development

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "developing-incremental-models" with this command: npx skills add altimateai/data-engineering-skills/altimateai-data-engineering-skills-developing-incremental-models

dbt Incremental Model Development

Choose the right strategy. Design the unique_key carefully. Handle edge cases.

When to Use Incremental

Scenario Recommendation

Source data < 10M rows Use table (simpler, full refresh is fast)

Source data > 10M rows Consider incremental

Source data updated in place Use incremental with merge strategy

Append-only source (logs, events) Use incremental with append strategy

Partitioned warehouse data Use insert_overwrite if supported

Default to table unless you have a clear performance reason for incremental.

Critical Rules

  • ALWAYS test with --full-refresh first before relying on incremental logic

  • ALWAYS verify unique_key is truly unique in both source and target

  • If merge fails 3+ times, check unique_key for duplicates

  • Run full refresh periodically to prevent data drift

Workflow

  1. Confirm Incremental is Needed

Check source table size

dbt show --inline "select count(*) from {{ source('schema', 'table') }}"

If count < 10 million, consider using table instead. Incremental adds complexity.

  1. Understand the Source Data Pattern

Before choosing a strategy, answer:

  • Is data append-only? (new rows added, never updated)

  • Are existing rows updated? (need merge/upsert)

  • Is there a reliable timestamp? (for filtering new data)

  • What's the unique identifier? (for merge matching)

Check for timestamp column

dbt show --inline " select min(updated_at) as earliest, max(updated_at) as latest, count(distinct date(updated_at)) as days_of_data from {{ source('schema', 'table') }} "

  1. Choose the Right Strategy

Strategy Use When How It Works

append

Data is append-only, no updates INSERT only, no deduplication

merge

Data can be updated MERGE/UPSERT by unique_key

delete+insert

Data updated in batches DELETE matching rows, then INSERT

insert_overwrite

Partitioned tables (BigQuery, Spark) Replace entire partitions

Default: merge is safest for most use cases.

Note: Strategy availability varies by adapter. Check the dbt incremental strategy docs for your specific warehouse.

  1. Design the Unique Key

CRITICAL: unique_key must be truly unique in your data.

Verify uniqueness BEFORE creating model

dbt show --inline " select {{ unique_key_column }}, count() from {{ source('schema', 'table') }} group by 1 having count() > 1 limit 10 "

If duplicates exist:

  • Add more columns to make composite key

  • Add deduplication logic in model

  • Use delete+insert instead of merge

  1. Write the Incremental Model

{{ config( materialized='incremental', incremental_strategy='merge', -- or append, delete+insert unique_key='id', -- MUST be unique on_schema_change='append_new_columns' -- handle new columns ) }}

select id, column_a, column_b, updated_at from {{ source('schema', 'table') }}

{% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %}

  1. Build with Full Refresh First

ALWAYS verify with full refresh before trusting incremental logic.

First run: full refresh to establish baseline

dbt build --select <model_name> --full-refresh

Verify output

dbt show --select <model_name> --limit 10 dbt show --inline "select count(*) from {{ ref('model_name') }}"

  1. Test Incremental Logic

Run incrementally (no --full-refresh)

dbt build --select <model_name>

Verify row count changed appropriately

dbt show --inline "select count(*) from {{ ref('model_name') }}"

  1. Handle Schema Changes

Set on_schema_change based on your needs:

Setting Behavior

ignore (default) New columns in source are ignored

append_new_columns

New columns added to target

sync_all_columns

Target schema matches source exactly

fail

Error if schema changes

Common Incremental Problems

Problem: Merge Fails with Duplicate Key

Symptom: "Cannot MERGE with duplicate values"

Cause: Multiple rows with same unique_key in source or target.

Fix:

-- Add deduplication using a CTE (cross-database compatible) with deduplicated as ( select *, row_number() over (partition by id order by updated_at desc) as rn from {{ source('schema', 'table') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %} ) select * from deduplicated where rn = 1

Problem: No Partition Pruning (Full Table Scan)

Symptom: Incremental runs take as long as full refresh.

Cause: Dynamic date filter prevents partition pruning.

Fix:

{% if is_incremental() %} -- Use static date instead of subquery for partition pruning where updated_at >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }} and updated_at > (select max(updated_at) from {{ this }}) {% endif %}

Problem: Late-Arriving Data is Missed

Symptom: Some records never appear in incremental model.

Cause: Filtering by max(updated_at) misses late arrivals.

Fix: Use a lookback window with a fixed offset from current date:

{% if is_incremental() %} -- Lookback 3 days to catch late-arriving data where updated_at >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }} {% endif %}

Alternatively, use a variable for the lookback period:

{% set lookback_days = 3 %}

{% if is_incremental() %} where updated_at >= {{ dbt.dateadd('day', -lookback_days, dbt.current_timestamp()) }} {% endif %}

Problem: Schema Drift Causes Errors

Symptom: "Column X not found" after source adds column.

Fix: Set on_schema_change='append_new_columns' in config.

Problem: Data Drift Over Time

Symptom: Counts diverge between incremental and full refresh.

Fix: Schedule periodic full refresh:

Weekly full refresh

dbt build --select <model_name> --full-refresh

Incremental Strategy Reference

Append (Simplest)

{{ config(materialized='incremental', incremental_strategy='append') }}

select * from {{ source('events', 'raw') }} {% if is_incremental() %} where event_timestamp > (select max(event_timestamp) from {{ this }}) {% endif %}

  • No unique_key needed

  • Fastest performance

  • Only use for append-only data (logs, events, immutable records)

Merge (Default)

{{ config( materialized='incremental', incremental_strategy='merge', unique_key='id' ) }}

select * from {{ source('crm', 'contacts') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %}

  • Requires unique_key

  • Handles updates and inserts

  • Most common strategy

Delete+Insert (Batch Updates)

{{ config( materialized='incremental', incremental_strategy='delete+insert', unique_key='id' ) }}

select * from {{ source('orders', 'raw') }} {% if is_incremental() %} where order_date >= {{ dbt.dateadd('day', -7, dbt.current_timestamp()) }} {% endif %}

  • Deletes all matching rows first

  • Good for reprocessing batches

  • Use when merge has duplicate key issues

Insert Overwrite (Partitioned)

{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={'field': 'event_date', 'data_type': 'date'} ) }}

select * from {{ source('events', 'raw') }} {% if is_incremental() %} where event_date >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }} {% endif %}

  • Replaces entire partitions

  • Best for partitioned tables in BigQuery/Spark

  • No unique_key needed (operates on partitions)

Anti-Patterns

  • Using incremental for small tables (< 10M rows)

  • Not testing with full-refresh first

  • Using append strategy when data can be updated

  • Not verifying unique_key uniqueness

  • Relying on exact timestamp match without lookback

  • Never running full refresh (causes data drift)

  • Using merge with non-unique keys

Testing Checklist

  • Model runs with --full-refresh

  • Model runs incrementally (without flag)

  • unique_key verified as truly unique

  • Row counts reasonable after incremental run

  • Late-arriving data handled (lookback window)

  • Schema changes handled (on_schema_change set)

  • Periodic full refresh scheduled

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

documenting-dbt-models

No summary provided by upstream source.

Repository SourceNeeds Review
General

testing-dbt-models

No summary provided by upstream source.

Repository SourceNeeds Review
General

optimizing-query-text

No summary provided by upstream source.

Repository SourceNeeds Review