If you're loading data into Snowflake the same way you loaded it into your legacy data warehouse, you're probably spending more money than necessary and dealing with longer refresh cycles than you should. Incremental loading isn't just a nice-to-have optimization—it's a fundamental strategy that separates efficient data pipelines from expensive, slow ones.
In this post, I'll walk you through the most effective incremental loading strategies for Snowflake, explain when to use each approach, and share practical patterns that we've battle-tested across dozens of production environments.
Why Incremental Loading Matters in Snowflake
Let's start with the business case. Full table reloads might seem simpler to implement, but they create three significant problems:
- Cost inefficiency: You're paying to process and store the same data repeatedly. With Snowflake's compute costs, this adds up quickly.
- Performance degradation: As tables grow, full reloads take progressively longer, eventually breaking SLA windows.
- Resource contention: Large batch loads create warehouse spikes that impact concurrent users and queries.
Incremental loading solves these problems by processing only what's changed since the last load. For a table with millions of rows where only thousands change daily, you're looking at potential cost savings of 90% or more.
Strategy 1: Timestamp-Based Incremental Loading
This is the workhorse pattern that most teams should master first. The concept is simple: track when records were last modified and only load records newer than your last successful load.
The Basic Implementation
Here's how this looks in practice:
-- Create a control table to track load watermarks
CREATE TABLE etl_watermarks (
table_name VARCHAR(255) PRIMARY KEY,
last_loaded_timestamp TIMESTAMP_NTZ,
updated_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Incremental load query
INSERT INTO target_table
SELECT *
FROM source_table
WHERE modified_date > (
SELECT last_loaded_timestamp
FROM etl_watermarks
WHERE table_name = 'target_table'
)
AND modified_date <= CURRENT_TIMESTAMP();
-- Update watermark after successful load
UPDATE etl_watermarks
SET last_loaded_timestamp = CURRENT_TIMESTAMP(),
updated_at = CURRENT_TIMESTAMP()
WHERE table_name = 'target_table';The key insight here is the upper bound on the timestamp filter. By capping at the start of your load process, you avoid missing records that arrive during processing—a common gotcha that causes data loss.
When to Use Timestamp-Based Loading
This strategy works best when:
- Your source system maintains reliable
updated_atormodified_datecolumns - You're dealing with insert-heavy workloads or append-only data
- You need a straightforward approach that's easy to monitor and debug
The main limitation? It doesn't handle hard deletes well. If a record is deleted from the source, timestamp-based loading won't detect it.
Strategy 2: Change Data Capture (CDC) with Streams
Snowflake Streams are purpose-built for incremental processing, and they're genuinely elegant when used correctly. A Stream creates a change table that tracks inserts, updates, and deletes on a source table.
Implementing Stream-Based Loading
-- Create a stream on your staging table
CREATE STREAM customer_stream ON TABLE customer_stage;
-- Process changes with full DML awareness
MERGE INTO customer_target t
USING customer_stream s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.METADATA$ACTION = 'DELETE'
THEN DELETE
WHEN MATCHED AND s.METADATA$ACTION = 'INSERT'
THEN UPDATE SET
t.name = s.name,
t.email = s.email,
t.updated_at = s.updated_at
WHEN NOT MATCHED AND s.METADATA$ACTION = 'INSERT'
THEN INSERT (customer_id, name, email, updated_at)
VALUES (s.customer_id, s.name, s.email, s.updated_at);The beauty of Streams is that they automatically advance their offset after successful consumption. No manual watermark management needed. The METADATA$ACTION and METADATA$ISUPDATE columns tell you exactly what changed.
Stream Best Practices
After working with Streams across multiple production systems, here are the patterns that matter:
- Process streams regularly: Stream storage has costs. Process and consume changes at least daily to avoid accumulating large change sets.
- Use transactional consumption: Wrap your stream consumption in explicit transactions to ensure atomicity.
- Monitor stream staleness: Set up alerts if streams haven't been consumed within your expected interval.
Streams shine for internal Snowflake table-to-table pipelines but can't directly track changes in external systems. For that, you need a different approach.
Strategy 3: Hash Diff Pattern for SCD Type 2
When you need to track historical changes and maintain Slowly Changing Dimension (SCD) Type 2 logic, the hash diff pattern is your friend. This approach detects changes by comparing hash values of concatenated columns.
-- Add hash column to your target table
ALTER TABLE customer_target
ADD COLUMN row_hash VARCHAR(64);
-- Generate hash in staging
CREATE OR REPLACE VIEW customer_stage_hashed AS
SELECT
*,
SHA2_HEX(CONCAT_WS('||',
customer_id,
name,
email,
address
)) as row_hash
FROM customer_stage;
-- Identify changes
WITH changes AS (
SELECT s.*
FROM customer_stage_hashed s
LEFT JOIN customer_target t
ON s.customer_id = t.customer_id
AND t.is_current = TRUE
WHERE t.row_hash IS NULL
OR t.row_hash != s.row_hash
)
-- Expire old records and insert new versions
-- (Full SCD Type 2 logic here)This pattern is computationally more expensive than pure timestamp-based loading because you're comparing hash values across potentially large tables. Use it when you specifically need historical tracking, not as a default incremental strategy.
Strategy 4: Merge with Conditional Logic
The MERGE statement is Snowflake's Swiss Army knife for incremental loading. It handles inserts, updates, and deletes in a single statement, which is both elegant and performant.
Here's a production-ready pattern with conditional update logic:
MERGE INTO orders_target t
USING orders_staging s
ON t.order_id = s.order_id
WHEN MATCHED
AND (t.status != s.status
OR t.total_amount != s.total_amount
OR t.updated_at < s.updated_at)
THEN UPDATE SET
t.status = s.status,
t.total_amount = s.total_amount,
t.updated_at = s.updated_at,
t.etl_updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED
THEN INSERT (order_id, status, total_amount, updated_at, etl_inserted_at)
VALUES (s.order_id, s.status, s.total_amount, s.updated_at, CURRENT_TIMESTAMP());The conditional update clause is crucial—it prevents unnecessary writes when data hasn't actually changed, which reduces costs and avoids triggering downstream streams unnecessarily.
Choosing the Right Strategy
After implementing hundreds of data pipelines, here's my opinionated decision framework:
- Start with timestamp-based loading for 80% of use cases. It's simple, debuggable, and cost-effective.
- Use Streams for table-to-table pipelines within Snowflake, especially when you need to track deletes.
- Implement hash diff only when you specifically need SCD Type 2 historical tracking.
- Reserve full reloads for small dimension tables (under 100K rows) where the simplicity is worth the cost.
Performance and Cost Optimization Tips
Regardless of which strategy you choose, these optimizations apply universally:
Cluster your tables appropriately: If you're filtering on timestamp columns, cluster on those columns. The metadata pruning will dramatically reduce scan costs.
Size your warehouses correctly: Incremental loads often work fine on smaller warehouses. Don't default to large warehouses out of habit.
Batch appropriately: Loading every 5 minutes isn't always better than hourly. Balance freshness requirements against compute costs.
Monitor query profiles: Use Snowflake's query profile to identify partition pruning efficiency. If you're scanning the full table despite filters, you have a clustering or design issue.
Wrapping Up
Incremental loading isn't a single technique—it's a toolkit that you match to your specific requirements. The timestamp-based approach will serve you well for most scenarios, while Streams and hash diff patterns solve specific challenges in CDC and historical tracking.
The real skill is recognizing which tool to use when, and being willing to refactor as your requirements evolve. A pipeline that starts with full reloads might graduate to timestamp-based incremental loading, and eventually to Stream-based processing as data volumes grow.
The best data engineers I've worked with treat incremental loading as a default mindset, not an optimization you add later. Build it in from the start, and your future self will thank you when those tables hit millions of rows.