pg_trickle Blog
Note: This blog directory is an experiment. All posts were generated with AI assistance (GitHub Copilot / Claude) as a way to explore how well LLM-generated technical writing holds up for a niche systems engineering topic. The technical content has been reviewed for accuracy, but treat the posts as drafts — not as officially reviewed documentation. The blog is meant for informative purposes — to learn about interesting topics in the context of pg_trickle. It is a showcase for use-cases rather than a definitive reference; for authoritative documentation see the pg_trickle documentation.
Posts
Core Concepts & Theory
| Post | Summary |
|---|---|
| Why Your Materialized Views Are Always Stale | Explains why REFRESH MATERIALIZED VIEW fails at scale — locking, cost, and the full-scan ceiling — and how switching to a stream table with DIFFERENTIAL mode fixes staleness in 5 lines of SQL. |
| Differential Dataflow for the Rest of Us | A plain-language walkthrough of the mathematics behind incremental view maintenance: delta rules for filters, joins, aggregates, the MERGE application step, and why some aggregates (MEDIAN, RANK) can't be made incremental. |
| Incremental Aggregates in PostgreSQL: No ETL Required | How SUM, COUNT, AVG, and (in v0.37) vector_avg are maintained as running algebraic state rather than full scans. Covers multi-table aggregates, conditional aggregates, and the non-differentiable cases. |
| The Z-Set: The Data Structure That Makes IVM Correct | A concrete tour of the integer-weighted multiset that underlies pg_trickle's differential engine — how inserts are +1, deletes are -1, updates are both, and why commutativity eliminates an entire class of ordering bugs. |
| The Cost Model: How pg_trickle Decides Whether to Refresh Differentially | Inside AUTO mode: the decision inputs (delta ratio, query complexity, historical timings), the learned cost model, and when the engine switches between DIFFERENTIAL and FULL refresh mid-flight. |
SQL Operator Deep Dives
| Post | Summary |
|---|---|
| Recursive CTEs That Update Themselves | Semi-naive evaluation for insert-only tables and Delete-and-Rederive for mixed DML — how pg_trickle maintains WITH RECURSIVE queries incrementally for org charts, BOMs, and graph reachability. |
| Window Functions Without the Full Recompute | Partition-scoped recomputation for ROW_NUMBER, RANK, LAG, LEAD, and all standard window functions. Change one partition, leave the rest untouched. |
| GROUPING SETS, ROLLUP, and CUBE — Incrementally | Multi-dimensional aggregation decomposed into UNION ALL branches, each maintained with algebraic delta rules. Drill-down dashboards that refresh in milliseconds. |
| EXISTS and NOT EXISTS: The Delta Rules Nobody Talks About | Semi-joins and anti-joins maintained via reference counting on the join key. Delta-key pre-filtering, inverted semantics for NOT EXISTS, SubLink extraction from WHERE clauses. |
| DISTINCT That Doesn't Recount | Reference counting (__pgt_dup_count) for incremental deduplication. Insert increments, delete decrements, row removed when count hits zero. DISTINCT ON with tie-breaking. |
| Scalar Subqueries in the SELECT List — Incrementally | Pre/post snapshot diff for correlated subqueries. Only groups affected by the delta are re-evaluated — O(affected groups), not O(all rows). |
| LATERAL Joins in a Stream Table | Row-scoped re-execution for JSON_TABLE, unnest(), generate_series(), and correlated set-returning functions. Cost proportional to changed left-side rows. |
| Set Operations Done Right: UNION, INTERSECT, EXCEPT | Dual-count multiplicity tracking for all set operations. UNION uses reference counting, INTERSECT requires both-side presence, EXCEPT removes when the right side gains a match. |
Refresh Modes & Scheduling
| Post | Summary |
|---|---|
| IMMEDIATE Mode: When "Good Enough Freshness" Isn't Good Enough | Synchronous IVM inside the source transaction — zero lag, no background worker. Account balances, inventory tracking, and the trade-offs vs. DIFFERENTIAL mode. |
| How pg_trickle Handles Diamond Dependencies | When two branches of a DAG share a source and converge downstream, naively refreshing can cause double-counting. How the frontier tracker and diamond-group scheduling ensure correctness. |
| Temporal Stream Tables: Time-Windowed Views That Update Themselves | The "last 7 days" problem — results that change because time passes, not because data changed. Sliding-window eviction, the temporal_mode parameter, and when fixed windows don't need it. |
| Declare Freshness Once: CALCULATED Scheduling | Upstream tables derive their refresh cadence from downstream consumers. Set the SLA on the dashboard; the pipeline adjusts automatically. |
| Cycles in Your Dependency Graph? That's Fine. | Fixed-point iteration for monotone queries. allow_circular = on, SCC detection, convergence guarantees, the iteration limit, and when cycles are a legitimate design choice. |
| Hot, Warm, Cold, Frozen: Tiered Scheduling at Scale | Automatic tier classification by change frequency. The scheduler checks hot tables every cycle, frozen tables every ~60 cycles — 80%+ overhead reduction at 500+ stream tables. |
CDC & Change Tracking
| Post | Summary |
|---|---|
| The CDC Mode You Never Have to Choose | Hybrid CDC starts with triggers, silently graduates to WAL. Three-step transition orchestration, automatic fallback on failure, WAL backpressure, and why AUTO is the right default. |
| IVM Without Primary Keys | Content-based hashing (xxHash64) generates synthetic row identity for keyless tables. Multiplicity counting for duplicates, collision probability, and when to add a PK anyway. |
| Foreign Tables as Stream Table Sources | IVM over postgres_fdw, file_fdw, and parquet_fdw sources using polling-based change detection. Mixed local/foreign source queries, performance trade-offs, and the materialize-first optimization. |
Architecture & Data Patterns
| Post | Summary |
|---|---|
| The Medallion Architecture Lives Inside PostgreSQL | Bronze/Silver/Gold without Spark or Airflow. Chained stream tables propagate from raw ingest to business aggregates in under 5 seconds, with DAG-aware scheduling and transactional consistency. |
| CQRS Without a Second Database | Command Query Responsibility Segregation using stream tables as the read model — same PostgreSQL instance, no CDC pipeline, read-your-writes with IMMEDIATE mode. |
| Slowly Changing Dimensions in Real Time | SCD Type 2 (historical attribute tracking with valid_from/valid_to) maintained continuously by a stream table — no nightly ETL, no Airflow DAG. |
| The Append-Only Fast Path | Why insert-only tables (event logs, sensor data, clickstreams) get a 2–3× faster refresh: no delete-side delta, no inverse computation, no before-image lookups. |
Use Cases & Migration
| Post | Summary |
|---|---|
| Real-Time Leaderboards That Don't Lie | Top-N stream tables for games, sales dashboards, and coding challenges — tied scores, multi-category boards, the pagination problem, and why you might not need Redis. |
| The Hidden Cost of Trigger-Based Denormalization | Four failure modes of hand-rolled trigger sync — blind UPDATE divergence, statement vs. row trigger semantics, invisible deletes, and multi-row races — and how declarative IVM avoids all of them. |
| How We Replaced a Celery Pipeline with 3 SQL Statements | A before/after case study of a Celery + Elasticsearch product search pipeline across three generations of growing complexity, and the pg_trickle stream table that replaced it. Includes benchmark numbers. |
| Migrating from pg_ivm to pg_trickle | Feature gap table, SQL syntax differences, step-by-step migration procedure, and when staying on pg_ivm is the right call. |
Integrations & Ecosystem
| Post | Summary |
|---|---|
| Streaming to Kafka Without Kafka Expertise | pgtrickle-relay bridges stream table deltas to Kafka, NATS, SQS, and webhooks — a single binary with TOML config, advisory-lock HA, subject routing, and Prometheus metrics. |
| The Relay Deep Dive: NATS, Redis Streams, and RabbitMQ | Beyond Kafka: per-backend architecture for NATS JetStream, Redis Streams, RabbitMQ, SQS, and HTTP webhooks. Subject templates, consumer groups, multi-sink pipelines, and a decision tree for choosing a backend. |
| The Inbox Pattern: Receiving Events from Kafka into PostgreSQL | Idempotent, ordered event ingestion via the inbox table — deduplication by event ID, dead-letter queue, and stream tables that aggregate incoming events incrementally. |
| The Outbox You Don't Have to Build | pg_trickle's built-in outbox API: enable_outbox(), consumer groups, poll_outbox(), offset tracking, exactly-once delivery, consumer lag monitoring, and cleanup. |
| dbt + pg_trickle: The Analytics Engineer's Stack | The pgtrickle dbt materialization: continuously-fresh models that are also version-controlled, tested, and documented. DAG alignment, freshness checks, and mixing materializations. |
| Distributed IVM with Citus | Incremental view maintenance across sharded PostgreSQL: per-worker CDC, shard-aware delta routing, co-located join push-down, and automatic recovery after shard rebalances. |
| pg_trickle on CloudNativePG | Production Kubernetes deployment using the CloudNativePG operator: Dockerfile, Cluster manifest, GUC configuration, HA failover behaviour, Prometheus metrics ConfigMap, alerting rules, upgrade procedure, and sizing guidance. |
| Making pg_trickle Work Through PgBouncer | Connection pooling modes, the background-worker bypass, LISTEN/NOTIFY caveats in transaction mode, and a configuration checklist for PgBouncer + pg_trickle. |
| Publishing Stream Tables via Logical Replication | Stream tables as standard publication sources for downstream PostgreSQL instances. Replication identity, multi-region distribution, and feeding Debezium/Kafka with clean aggregated events. |
| One PostgreSQL, Five Databases, One Worker Pool | Multi-database architecture: one launcher per server, one scheduler per database, shared worker pool with per-database quotas. Failure isolation and the database-per-tenant SaaS pattern. |
pgvector Integration
| Post | Summary |
|---|---|
| Your pgvector Index Is Lying to You | Four silent failure modes of unmanaged pgvector deployments: stale embedding corpora, drifting aggregates, IVFFlat recall loss, and over-fetching. How pg_trickle's differential IVM and drift-aware reindexing closes each gap. |
| Incremental Vector Aggregates: Building Recommendation Engines in Pure SQL | How vector_avg (v0.37) turns user taste vectors, category centroids, and cluster representatives into live algebraic aggregates — O(new interactions) cost, not O(history). Comparison with batch recomputation, feature stores, and application-level updates. |
| Deploying RAG at Scale: pg_trickle as Your Embedding Infrastructure | Production operations for pgvector + pg_trickle: drift-aware HNSW reindexing (reindex_if_drift), vector_status() monitoring, multi-tenant tiered indexing patterns, sparse/half-precision aggregates, reactive distance subscriptions, and the embedding_stream_table() ergonomic API. |
| HNSW Recall Is a Lie: Distribution Drift Explained | Deep dive on IVFFlat centroid staleness and HNSW tombstone accumulation — how to measure drift, what the right threshold is, and how post_refresh_action => 'reindex_if_drift' (v0.38) automates the fix. |
| The pgvector Tooling Landscape in 2026 | Honest comparison of pg_trickle against pgai (archived Feb 2026), pg_vectorize, DIY batch pipelines, and Debezium. Introduces the two-layer model: Layer 1 = embedding generation, Layer 2 = derived-state maintenance. |
| Multi-Tenant Vector Search with Row-Level Security | Zero cross-tenant data leakage using RLS policies on stream tables, tiered tenancy (large / medium / small tenant strategies), per-tenant partial HNSW indexes, and drift-aware reindexing per partition. |
Operations & Observability
| Post | Summary |
|---|---|
| Stop Rebuilding Your Search Index at 3am | How pg_trickle's scheduler, SLA tiers (critical / standard / background), backpressure, and parallel workers let you tune refresh behaviour per workload — and why the 3am maintenance window disappears with continuous incremental refresh. |
| pg_trickle Monitors Itself | Since v0.20, the extension's own health metrics are maintained as stream tables. How self-monitoring works, what it tracks, and the recursion question ("who monitors the monitor?"). |
| How to Change a Stream Table Query Without Taking It Offline | ALTER STREAM TABLE ... QUERY performs online schema evolution — the stream table stays queryable during migration, with atomic swap and cascade-safe dependency checking. |
| Backup and Restore for Stream Tables | pg_dump, PITR, selective restore, and the repair_stream_table procedure. What to do (and what breaks) when you restore a database with active stream tables. |
| Testing Stream Tables: Shadow Mode and Correctness Fuzzing | Shadow mode runs DIFFERENTIAL and FULL refresh in parallel and compares. SQLancer fuzzing generates random schemas and DML to find delta engine bugs. The multiset invariant and what it caught. |
| Snapshots: Time Travel for Stream Tables | snapshot_stream_table() captures point-in-time copies for pre-migration safety, replica bootstrap, forensic comparison, and test fixtures. Restore, list, and clean up with one function call each. |
| Drain Mode: Zero-Downtime Upgrades for Stream Tables | pgtrickle.drain() quiesces in-flight refreshes before maintenance. Safe upgrade workflow, CloudNativePG integration, HA failover, and the resume path. |
| Column-Level Lineage in One Function Call | stream_table_lineage() maps output columns to source columns. Impact analysis before ALTER TABLE, GDPR column-deletion audit, documentation generation, and recursive DAG tracing. |
| Error Budgets for Stream Tables | SRE-style freshness monitoring: sla_summary() with p50/p99 latency, staleness tracking, error budget consumption, alerting thresholds, and Prometheus integration. |
| Structured Logging and OpenTelemetry for Stream Tables | log_format = json emits structured events with cycle_id correlation. Event taxonomy, log aggregator integration (Loki, Datadog, Elasticsearch), and OpenTelemetry compatibility. |
Analytics & Feature Engineering
| Post | Summary |
|---|---|
| Funnel Analysis and Cohort Retention at Scale | Computing conversion funnels, retention matrices, and session aggregates incrementally — keeping product analytics live without billion-row scans. |
| Incremental ML Feature Engineering in PostgreSQL | Replace nightly feature store batch jobs with continuously fresh features: rolling windows, lag features, cross-entity comparisons, all maintained as stream tables. |
| Time-Series Downsampling Without TimescaleDB | Hourly, daily, and monthly rollups maintained incrementally from raw sensor data — cascading stream tables as a lightweight alternative to a dedicated TSDB. |
| Incremental Statistical Aggregates: stddev, Percentiles, and Histograms | Which higher-order statistics (variance, correlation, histograms) can be maintained exactly, which need approximations, and the space-accuracy trade-offs. |
Data Patterns & Domain Applications
| Post | Summary |
|---|---|
| Event Sourcing Read Models Without Replay | Project live read-optimized views from an append-only event store without replaying history — order status, revenue analytics, and inventory projections as stream tables. |
| Soft Deletes and Tombstone Management in Differential IVM | How deleted_at patterns interact with delta propagation, ghost row pitfalls, cascading visibility, and best practices for correct stream tables over soft-deletable data. |
| Compliance and Audit Trails with Append-Only Stream Tables | GDPR-compliant, tamper-evident audit logs: right-to-erasure reconciliation, hash chains, access pattern monitoring, and retention policies — all incrementally maintained. |
| Incremental Full-Text Search with tsvector | Maintain ranked search results incrementally as documents change — tracked queries, faceted counts, and top-K ranking without re-indexing the corpus. |
| Incremental PageRank and Graph Analytics in SQL | Live PageRank, connected components, and shortest-path metrics maintained inside PostgreSQL as stream tables — no graph database required. |
| PostGIS + pg_trickle: Incremental Geospatial Aggregates | Heatmaps, geofencing, spatial clustering, and distance-based aggregation that update in milliseconds as new points arrive. |
Deployment & Multi-Tenancy
| Post | Summary |
|---|---|
| High Availability Failover with pg_trickle and Patroni | How stream table state survives primary switchover, WAL replay semantics for change buffers, split-brain prevention, and zero-data-loss configuration. |
| Parameterized Stream Tables: Building a SQL View Library | Patterns for reusable, tenant-scoped, and versionable stream table definitions: single-table multi-tenant, template functions, schema isolation, and composable building blocks. |
Performance Internals
| Post | Summary |
|---|---|
| The 45ms Cold-Start Tax and How L0 Cache Eliminates It | Connection poolers recycle backends, paying a template-parse penalty. The L0 process-local RwLock<HashMap> cache keyed by (pgt_id, cache_generation) drops p99 from 48ms to 6ms. |
| Spill-to-Disk and the Auto-Fallback Safety Net | When delta queries exceed work_mem, pg_trickle detects consecutive spills and auto-switches to FULL refresh. Tuning merge_work_mem_mb, spill_threshold_blocks, and the self-healing recovery path. |
Benchmarks & Advanced Patterns
| Post | Summary |
|---|---|
| TPC-H at 1GB in 40ms | Reproducible benchmark of differential vs. full refresh across five TPC-H queries (Q1, Q3, Q5, Q6, Q12). Results: 13–22× faster per refresh cycle, with differential lag under 2.5 seconds vs. 186 seconds at 5,000 rows/second sustained write load. |
| From Nexmark to Production: Benchmarking Stream Processing in PostgreSQL | pg_trickle on the Nexmark streaming benchmark: per-query throughput, latency percentiles, and how the numbers compare to Flink, Materialize, and a cron job. |
| Reactive Alerts Without Polling | How pg_trickle's reactive subscriptions (v0.39) replace polling loops: SLA breach detection, inventory alerts, fraud velocity checks, and vector distance subscriptions. Covers OLD.*/NEW.* transition semantics and PostgreSQL LISTEN. |
| The Outbox Pattern, Turbocharged | Using stream tables as transactionally consistent event sources for the outbox pattern — derived aggregate events, fat payloads, transition-based routing, and why stream tables naturally debounce high-frequency changes into fewer events. |
Contributing
These posts are deliberately rough-edged — they're drafts exploring how the extension works, not polished marketing copy. If you spot a technical inaccuracy, open an issue or PR. If you want to write a post, open a discussion first to avoid duplication.
← Back to Blog Index | Documentation
The Append-Only Fast Path
Why event logs get special treatment in the differential engine
Most tables in a production database see INSERTs, UPDATEs, and DELETEs. The differential engine has to handle all three: compute the delta for inserted rows, compute the inverse delta for deleted rows, and handle updates as a delete-then-insert pair.
But some tables never see UPDATEs or DELETEs. Event logs, audit trails, IoT sensor data, clickstreams, financial journal entries — these are append-only by design. Once a row is written, it's never changed.
pg_trickle detects this pattern and takes a faster code path. No delete-side delta computation. No inverse operations. No before-image lookups. Just the forward delta from the new rows.
What the Fast Path Skips
In the general case, when pg_trickle processes a change buffer, it needs to:
-
Separate inserts from deletes. The change buffer contains rows tagged with
+1(insert) or-1(delete). Updates appear as a-1for the old value and+1for the new value. -
Compute the forward delta. For inserted rows: join with existing data, compute aggregate contributions, determine which groups are affected.
-
Compute the inverse delta. For deleted rows: look up the old aggregate state, subtract the deleted row's contribution, handle edge cases like a group becoming empty.
-
Merge the deltas. Combine the forward and inverse deltas into a single MERGE operation against the storage table.
For append-only sources, step 3 is unnecessary. There are no deleted rows. There are no updates. The change buffer contains only +1 rows.
This means the engine can skip:
- The delete-side delta computation
- The before-image lookup (checking the current aggregate state to compute the subtraction)
- The empty-group cleanup (removing groups that no longer have any contributing rows)
- The MERGE conflict handling for deleted groups
How pg_trickle Detects Append-Only Sources
pg_trickle doesn't require you to declare a table as append-only. It detects it from the CDC trigger setup.
When you create a stream table, pg_trickle attaches AFTER INSERT, AFTER UPDATE, and AFTER DELETE triggers to each source table. The change buffer records what kind of operation produced each row.
If a source table has only ever produced INSERT operations in its change buffer, pg_trickle's scheduler notes this and enables the append-only fast path for that source. If an UPDATE or DELETE ever appears, the fast path is disabled for that source and the general delta path is used.
This is automatic. You don't configure it. You don't even need to know about it — it just makes things faster.
The Performance Difference
Numbers from the TPC-H benchmark suite, measuring refresh cycle time for the lineitem table (append-only workload — inserts only):
| Scenario | Rows/batch | General path | Append-only path | Speedup |
|---|---|---|---|---|
| Single-table SUM | 100 | 1.8ms | 0.9ms | 2.0× |
| Single-table SUM | 1,000 | 8.2ms | 3.1ms | 2.6× |
| JOIN + GROUP BY | 100 | 4.5ms | 2.1ms | 2.1× |
| JOIN + GROUP BY | 1,000 | 22ms | 8.5ms | 2.6× |
| Multi-table 3-way JOIN | 100 | 12ms | 5.8ms | 2.1× |
The speedup is roughly 2–3× for most queries. The savings come from skipping the delete-side computation entirely — no inverse lookups, no subtraction, no group cleanup.
For high-throughput event ingestion (thousands of rows per second), this adds up. A 2.5× speedup on the refresh cycle means you can handle 2.5× more events before the scheduler falls behind.
Event Log Example
An IoT sensor platform ingesting temperature readings:
CREATE TABLE sensor_readings (
id bigserial PRIMARY KEY,
sensor_id bigint NOT NULL,
temperature numeric(5,2) NOT NULL,
recorded_at timestamptz NOT NULL DEFAULT now()
);
-- Aggregate by sensor, hourly
SELECT pgtrickle.create_stream_table(
'sensor_hourly_avg',
$$SELECT
sensor_id,
date_trunc('hour', recorded_at) AS hour,
AVG(temperature) AS avg_temp,
MIN(temperature) AS min_temp,
MAX(temperature) AS max_temp,
COUNT(*) AS reading_count
FROM sensor_readings
GROUP BY sensor_id, date_trunc('hour', recorded_at)$$,
schedule => '1s',
refresh_mode => 'DIFFERENTIAL'
);
If sensor_readings is append-only (sensors only produce new readings, never update or delete old ones), the fast path kicks in automatically. Each refresh cycle processes only the new readings since the last cycle, using only the forward delta.
At 10,000 readings per second across 1,000 sensors, the refresh cycle processes approximately 10,000 rows per second. With the append-only fast path, each cycle takes about 3ms. Without it, about 8ms. Both are fast, but the margin matters when you're running at sustained high throughput.
Clickstream Analytics
Same pattern, different domain:
CREATE TABLE page_views (
id bigserial PRIMARY KEY,
user_id bigint,
page_url text NOT NULL,
referrer text,
device_type text,
viewed_at timestamptz NOT NULL DEFAULT now()
);
-- Real-time page popularity
SELECT pgtrickle.create_stream_table(
'page_popularity',
$$SELECT
page_url,
COUNT(*) AS view_count,
COUNT(DISTINCT user_id) AS unique_visitors,
MAX(viewed_at) AS last_view
FROM page_views
WHERE viewed_at >= now() - interval '24 hours'
GROUP BY page_url$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
page_views is append-only — you don't go back and edit page views. The fast path applies. The COUNT(DISTINCT user_id) aggregate is maintained incrementally with a HyperLogLog-style approximation for the distinct count (exact distinct counts require seeing the full group, but the COUNT DISTINCT fast path handles the common case correctly).
When the Fast Path Doesn't Apply
The fast path is disabled for a source table if:
-
Any UPDATE or DELETE is ever captured. One UPDATE on
sensor_readingsand the fast path is disabled for that source. It won't re-enable automatically (the engine can't guarantee future operations will be insert-only). -
The query uses the source in a subquery with NOT EXISTS or EXCEPT. These operators need to check for the absence of rows, which requires the full delete-side delta.
-
IMMEDIATE mode. In IMMEDIATE mode, the delta computation runs in the trigger itself, and the engine always uses the general path for simplicity. The fast path optimization is specific to DIFFERENTIAL mode's batch processing.
Designing for the Fast Path
If you're building a system with high-throughput event ingestion and you want maximum refresh performance:
-
Use append-only tables for event data. Don't UPDATE rows in your event log. If an event needs correction, insert a new event (a compensating event) rather than modifying the original.
-
Separate mutable and immutable data. Keep your append-only events in one table and your mutable reference data (user profiles, product metadata) in another. The stream table can JOIN both — the fast path applies independently per source table.
-
Use DIFFERENTIAL mode. The fast path only applies in DIFFERENTIAL mode, where the engine batches changes and can optimize the entire batch.
The fast path is an optimization, not a feature you need to design around. If your tables happen to be append-only, pg_trickle rewards you with faster refresh cycles automatically.
← Back to Blog Index | Documentation
Backup and Restore for Stream Tables
pg_dump, PITR, and what happens when you restore a database with active stream tables
Stream tables are PostgreSQL tables. They have OIDs, they're in the catalog, they have indexes. pg_dump includes them.
But stream tables also have associated infrastructure: CDC triggers on source tables, change buffer tables in pgtrickle_changes, catalog entries in pgtrickle.pgt_stream_tables, and internal state (frontiers, refresh history, operator trees).
If you restore a pg_dump without understanding how these pieces interact, you can end up with stream tables that look correct but don't refresh, or CDC triggers that are missing, or change buffers that contain stale data.
This post explains what to do.
What pg_dump Captures
pg_dump captures:
| Component | Included in pg_dump? | Notes |
|---|---|---|
| Stream table (storage table) | ✅ | Regular table, fully dumped |
| Stream table data | ✅ | Snapshot at dump time |
| pgtrickle catalog tables | ✅ | pgt_stream_tables, pgt_dependencies, etc. |
| CDC triggers on source tables | ✅ | Part of the source table definition |
| Change buffer tables (pgtrickle_changes.*) | ✅ | But data may be stale or irrelevant after restore |
| Extension registration | ✅ | CREATE EXTENSION pg_trickle |
| Background worker state | ❌ | In-memory, not persisted |
| Shared memory state (frontiers) | ❌ | Rebuilt on startup |
The Simple Case: pg_dump + pg_restore
# Dump
pg_dump -Fc mydb > mydb.dump
# Restore to a new database
createdb mydb_restored
pg_restore -d mydb_restored mydb.dump
After restore:
- Stream table data is present — but it's a snapshot from dump time. Any changes to source tables after the dump are not reflected.
- CDC triggers are present — they'll start capturing changes as soon as the source tables are modified.
- Change buffers may contain stale data — rows from before the dump that were never processed.
- The scheduler starts automatically — if
pg_trickle.enabled = on, the background worker picks up the stream tables and starts refreshing.
The Problem: Stale Change Buffers
The change buffers captured by pg_dump contain changes that were pending at dump time. After restore, the scheduler tries to process these changes — but the stream table already has the correct data (it was dumped with its data). Processing stale change buffer rows can cause double-counting.
The Fix: Repair After Restore
-- After restoring from pg_dump, repair all stream tables
SELECT pgtrickle.repair_stream_table(pgt_name)
FROM pgtrickle.pgt_stream_tables;
repair_stream_table does:
- Truncates the change buffer for the stream table's sources.
- Resets the frontier to the current state.
- Runs a full refresh to ensure the stream table matches the current source data.
- Re-registers CDC triggers if any are missing.
After repair, the stream table is consistent with the current source data and ready for incremental maintenance.
Point-in-Time Recovery (PITR)
PITR recovers the entire database to a specific point in time using WAL archives. Everything is consistent — source tables, stream tables, change buffers, catalog entries — because the recovery applies the WAL up to the target time.
# In recovery.conf (or postgresql.conf for PG 12+)
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2026-04-27 10:00:00'
After PITR recovery:
- Everything is consistent at the recovery point. Stream tables reflect the state of source tables at that exact time.
- The scheduler resumes from the frontier at the recovery point. It won't try to process changes from after the recovery point (they don't exist).
- No repair needed — the WAL recovery handles all the state consistently.
PITR is the cleanest restore option for pg_trickle. If you have WAL archiving set up (and you should), this is the recommended recovery method.
Selective Restore (Restoring Individual Tables)
Sometimes you need to restore a single table, not the entire database. This is trickier with stream tables.
Restoring a source table
If you restore a source table (e.g., orders) from a backup:
pg_restore -d mydb -t orders mydb.dump
The stream tables that depend on orders are now inconsistent — they reflect the old state of orders, but orders has been restored to a different state.
Fix:
-- Full refresh all stream tables that depend on 'orders'
SELECT pgtrickle.refresh_stream_table(st.pgt_name, force_mode => 'FULL')
FROM pgtrickle.pgt_stream_tables st
JOIN pgtrickle.pgt_dependencies d ON d.pgt_id = st.pgt_id
WHERE d.source_table = 'public.orders';
Restoring a stream table
If you restore a stream table itself:
pg_restore -d mydb -t orders_summary mydb.dump
The stream table now has data from the dump time, but the change buffer and frontier are from the current state. This is inconsistent.
Fix:
SELECT pgtrickle.repair_stream_table('orders_summary');
Snapshots
For cases where you need a consistent backup of a specific stream table (without dumping the entire database), pg_trickle has a snapshot feature:
-- Create a named snapshot
SELECT pgtrickle.snapshot_stream_table('orders_summary', 'before_migration');
-- List snapshots
SELECT * FROM pgtrickle.list_snapshots();
-- Restore from snapshot
SELECT pgtrickle.restore_from_snapshot('orders_summary', 'before_migration');
-- Clean up
SELECT pgtrickle.drop_snapshot('before_migration');
Snapshots are full copies of the stream table data stored in a separate table. They're useful for:
- Pre-migration backups (before altering the stream table's query)
- A/B testing (compare the current state with a known-good snapshot)
- Debugging (restore to a specific state for investigation)
CloudNativePG and Kubernetes
If you're running pg_trickle on CloudNativePG (the Kubernetes operator), backups use barman and WAL archiving to S3. The restore procedure is the same as standard PITR — the operator handles the WAL recovery, and pg_trickle's state is consistent at the recovery point.
One caveat: if you're restoring a replica that was promoted to primary, the stream table scheduler needs to be enabled on the new primary:
ALTER SYSTEM SET pg_trickle.enabled = on;
SELECT pg_reload_conf();
The scheduler only runs on the primary. Replicas don't run background workers.
Best Practices
-
Use PITR over pg_dump when possible. PITR gives you a consistent point-in-time state. pg_dump requires post-restore repair.
-
Always repair after pg_restore. Run
repair_stream_tableon every stream table after restoring from a logical dump. -
Don't exclude change buffer tables from dumps. The change buffers (
pgtrickle_changes.*) should be included in the dump. Excluding them causes the scheduler to miss changes that were pending at dump time. -
Snapshot before migrations. Before running
alter_stream_tablewith a query change, take a snapshot. If the migration goes wrong, you can restore the snapshot and try again. -
Test your restore procedure. Restore to a test database and verify that stream tables are refreshing correctly. Check
pgtrickle.health_check()andpgtrickle.pgt_status()after restore.
← Back to Blog Index | Documentation
The Outbox You Don't Have to Build
pg_trickle's built-in outbox: consumer groups, offset tracking, exactly-once delivery
The transactional outbox pattern is well-known: write an event row in the same transaction as the business data, then have a poller read the event table and publish to a message broker. It guarantees that events are published if and only if the business transaction commits.
But building a correct outbox is tedious. You need a polling loop, offset tracking, consumer groups (if multiple consumers read the same events), and cleanup of consumed messages. Most teams spend a week building this and then spend a year maintaining it.
pg_trickle has a built-in outbox that takes one function call to enable. Stream table deltas — the rows added and removed by each refresh — are automatically captured as outbox messages. Consumer groups, offset tracking, and exactly-once delivery are provided out of the box.
This is distinct from the outbox pattern post, which describes using stream tables as the outbox. This post is about the outbox API itself — the machinery that makes it work.
Enabling the Outbox
-- Create a stream table
SELECT pgtrickle.create_stream_table(
name => 'order_summary',
query => $$ SELECT customer_id, COUNT(*), SUM(total) FROM orders GROUP BY customer_id $$,
schedule => '5s'
);
-- Enable outbox on it
SELECT pgtrickle.enable_outbox('order_summary');
That's it. From now on, every refresh of order_summary that produces changes (rows inserted, updated, or deleted) writes those changes as outbox messages.
What Gets Written
Each outbox message contains:
| Field | Description |
|---|---|
outbox_id | Monotonically increasing sequence number |
refresh_id | Which refresh cycle produced this message |
op | Operation: I (insert), D (delete), U (update) |
row_data | The row as JSONB |
old_row_data | For updates: the previous row values |
created_at | Timestamp of the refresh |
For a refresh that adds 3 customers and updates 1:
outbox_id | refresh_id | op | row_data | old_row_data
-----------+------------+----+-----------------------------------+------------------
1001 | 42 | I | {"customer_id":5,"count":1,...} | NULL
1002 | 42 | I | {"customer_id":6,"count":2,...} | NULL
1003 | 42 | I | {"customer_id":7,"count":1,...} | NULL
1004 | 42 | U | {"customer_id":3,"count":15,...} | {"customer_id":3,"count":14,...}
Consumer Groups
Multiple consumers can read from the same outbox independently. Each consumer group tracks its own offset.
-- Register a consumer group
SELECT pgtrickle.create_consumer_group('order_summary', 'analytics_pipeline');
SELECT pgtrickle.create_consumer_group('order_summary', 'search_indexer');
Each consumer group reads from the beginning and advances at its own pace. The analytics_pipeline can be 10 messages behind while the search_indexer is caught up. They don't interfere with each other.
Polling
-- Fetch next batch of messages
SELECT * FROM pgtrickle.poll_outbox('order_summary', 'search_indexer', batch_size => 100);
This returns up to 100 unread messages for the search_indexer consumer group, starting from its last committed offset.
The messages are returned in outbox_id order — guaranteed monotonic and gap-free within a single stream table.
Committing Offsets
After processing a batch, commit the offset:
SELECT pgtrickle.commit_offset('order_summary', 'search_indexer', 1004);
This marks all messages up to outbox_id = 1004 as consumed for the search_indexer group. The next poll_outbox() call will start from 1005.
Important: Offset commit is idempotent. Committing the same offset twice is a no-op. Committing a lower offset than the current one is also a no-op (offsets only move forward).
Exactly-Once Processing
The outbox provides exactly-once delivery semantics via the offset mechanism. Each message is delivered exactly once to each consumer group (assuming the consumer commits the offset after processing).
For exactly-once processing, you need idempotency on the consumer side. The standard approach:
# Consumer pseudocode
while True:
messages = poll_outbox('order_summary', 'search_indexer', batch_size=100)
if not messages:
time.sleep(1)
continue
for msg in messages:
# Process idempotently (e.g., upsert to search index)
process(msg)
# Commit after successful processing
commit_offset('order_summary', 'search_indexer', messages[-1].outbox_id)
If the consumer crashes between processing and committing, the next poll re-delivers the uncommitted messages. The consumer must handle re-delivery gracefully (idempotent upserts, deduplication by outbox_id, etc.).
Consumer Lag
Monitor how far behind a consumer is:
SELECT * FROM pgtrickle.consumer_lag('order_summary');
consumer_group | committed_offset | latest_offset | lag | lag_seconds
--------------------+------------------+---------------+------+-------------
analytics_pipeline | 998 | 1004 | 6 | 12.4
search_indexer | 1004 | 1004 | 0 | 0.0
lag is the number of unconsumed messages. lag_seconds is the time between the consumer's committed offset and the latest message. If lag grows continuously, the consumer can't keep up.
Outbox Status
SELECT * FROM pgtrickle.outbox_status('order_summary');
stream_table | enabled | total_messages | oldest_unconsumed | consumer_groups
----------------+---------+----------------+-------------------+-----------------
order_summary | t | 15,247 | 998 | 2
oldest_unconsumed is the lowest outbox_id that any consumer group hasn't committed yet. Messages below this point can be cleaned up.
Message Cleanup
Outbox messages accumulate. pg_trickle doesn't automatically delete them because it doesn't know when all consumers are done.
You can clean up consumed messages manually:
-- Delete messages that all consumer groups have committed past
SELECT pgtrickle.cleanup_outbox('order_summary');
This deletes all messages with outbox_id < min(committed_offsets across all consumer groups). It's safe — no consumer can ever re-read those messages.
For automated cleanup, schedule it:
-- pg_cron: clean up every hour
SELECT cron.schedule('outbox-cleanup', '0 * * * *', $$
SELECT pgtrickle.cleanup_outbox('order_summary');
$$);
Outbox + Relay
The outbox API is the foundation for pgtrickle-relay. The relay binary is a consumer that polls the outbox and publishes to external brokers (Kafka, NATS, SQS, etc.).
Under the hood, relay:
- Creates a consumer group:
pgtrickle.create_consumer_group('order_summary', 'relay_kafka'). - Polls in a loop:
pgtrickle.poll_outbox(...). - Publishes to Kafka.
- Commits the offset after the Kafka ack.
You can use the outbox API directly for custom consumers, or use relay for standard broker integrations. They compose — one stream table can have both relay and custom consumers reading from the same outbox.
Disabling the Outbox
SELECT pgtrickle.disable_outbox('order_summary');
This stops writing new outbox messages. Existing messages remain in the table until cleaned up. Consumer groups remain registered (they'll see no new messages).
To fully clean up:
SELECT pgtrickle.disable_outbox('order_summary');
SELECT pgtrickle.cleanup_outbox('order_summary');
-- Drop consumer groups if no longer needed
When to Use the Built-In Outbox
Use the outbox when:
- You need to propagate stream table changes to external systems (search indexes, caches, notification services).
- Multiple consumers need independent reads of the same changes.
- You want offset tracking and replay without building it yourself.
Don't use the outbox when:
- You just need to query the stream table from SQL. The stream table itself is the API.
- You're using IMMEDIATE mode and need synchronous notification. Use PostgreSQL
LISTEN/NOTIFYor reactive subscriptions instead. - The stream table refreshes infrequently and changes are small. Direct polling of the stream table may be simpler.
Summary
pg_trickle's built-in outbox captures stream table deltas as consumable messages with consumer groups, offset tracking, and exactly-once delivery.
One function call to enable. One function call to poll. One function call to commit. Consumer lag monitoring, cleanup, and relay integration are all included.
If you're building an outbox from scratch on top of PostgreSQL, stop. The one you need already exists.
← Back to Blog Index | Documentation
Declare Freshness Once: CALCULATED Scheduling
How upstream tables derive their refresh cadence from downstream consumers
You have 15 stream tables. They form a DAG: raw tables → cleaned tables → aggregates → dashboards. The dashboard needs data within 10 seconds of reality. How fresh does each intermediate table need to be?
If you set every table to schedule => '2s', you're over-refreshing the tables nobody queries directly. If you set them to schedule => '30s', the dashboard falls behind. If you tune each one individually, you spend an afternoon doing arithmetic and then retune when the DAG changes.
pg_trickle's CALCULATED scheduling eliminates this problem. You declare the freshness requirement where it matters — on the consumer — and the system propagates it backward through the DAG.
The Idea
In CALCULATED mode, a stream table doesn't have a fixed schedule. Instead, its schedule is derived from the tightest (shortest interval) schedule among all stream tables that depend on it.
raw_events (no consumers yet → default 1s)
↓
cleaned_events (consumed by summary → inherits 5s)
↓
event_summary (schedule => '5s')
↓
dashboard_metrics (schedule => '10s')
In this DAG:
dashboard_metricshas a declared schedule of 10 seconds.event_summaryhas a declared schedule of 5 seconds.cleaned_eventshas no declared schedule. Its CALCULATED schedule is 5 seconds — inherited fromevent_summary, which is its tightest downstream consumer.raw_eventshas no declared schedule. Its CALCULATED schedule is also 5 seconds — inherited transitively.
If you later add a real-time alerting stream table that depends on cleaned_events with schedule => '1s', the CALCULATED schedule for cleaned_events and raw_events automatically tightens to 1 second. You don't change anything — the DAG propagation handles it.
Setting It Up
By default, a stream table uses a fixed schedule:
SELECT pgtrickle.create_stream_table(
name => 'event_summary',
query => $$ ... $$,
schedule => '5s'
);
To use CALCULATED scheduling on an intermediate table, omit the schedule or explicitly set it:
SELECT pgtrickle.create_stream_table(
name => 'cleaned_events',
query => $$ SELECT ... FROM raw_events WHERE valid = true $$,
schedule => 'CALCULATED'
);
When a stream table's schedule is CALCULATED, its effective refresh interval is the minimum of all downstream consumers' schedules. If no downstream consumer exists yet, it uses pg_trickle.default_schedule_seconds (default: 1 second).
How the Propagation Works
The scheduler maintains a dependency graph. When it evaluates the refresh schedule, it walks the DAG from leaves (tables with declared schedules) backward to roots:
- Collect all stream tables with declared (non-CALCULATED) schedules.
- For each CALCULATED stream table, find all downstream dependents.
- The effective schedule is
min(downstream schedules). - If the stream table has both a declared schedule and CALCULATED dependents, the declared schedule wins as a floor.
This computation happens once per scheduler cycle (every scheduler_interval_ms), not per refresh. The overhead is negligible — it's a graph traversal over a typically small DAG.
A Realistic Example
An e-commerce analytics pipeline:
-- Layer 1: Clean raw data (CALCULATED — derived from downstream)
SELECT pgtrickle.create_stream_table(
name => 'valid_orders',
query => $$
SELECT * FROM orders
WHERE status != 'cancelled' AND total > 0
$$,
schedule => 'CALCULATED'
);
-- Layer 2: Per-customer aggregates (CALCULATED)
SELECT pgtrickle.create_stream_table(
name => 'customer_metrics',
query => $$
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total) AS lifetime_value,
MAX(created_at) AS last_order
FROM valid_orders
GROUP BY customer_id
$$,
schedule => 'CALCULATED'
);
-- Layer 3a: Real-time dashboard (declared: 5s)
SELECT pgtrickle.create_stream_table(
name => 'dashboard_summary',
query => $$
SELECT
date_trunc('hour', last_order) AS hour,
COUNT(*) AS active_customers,
SUM(lifetime_value) AS total_ltv
FROM customer_metrics
GROUP BY 1
$$,
schedule => '5s'
);
-- Layer 3b: Weekly report (declared: 1h)
SELECT pgtrickle.create_stream_table(
name => 'weekly_report',
query => $$
SELECT
date_trunc('week', last_order) AS week,
COUNT(*) FILTER (WHERE order_count >= 5) AS repeat_customers
FROM customer_metrics
GROUP BY 1
$$,
schedule => '1h'
);
The effective schedules:
valid_orders: CALCULATED → 5s (fromdashboard_summaryviacustomer_metrics)customer_metrics: CALCULATED → 5s (fromdashboard_summary)dashboard_summary: 5s (declared)weekly_report: 1h (declared)
If you remove dashboard_summary, the CALCULATED tables relax to 1-hour cadence — because weekly_report is now the tightest consumer. No manual intervention needed.
The Default Schedule
When a CALCULATED stream table has no downstream consumers (it's a leaf that nobody else references), it falls back to pg_trickle.default_schedule_seconds. The default is 1 second.
This is intentional: CALCULATED tables are designed as intermediate nodes. If they're currently leaves, it's because the downstream consumer hasn't been created yet. Refreshing them at 1-second cadence ensures they're ready when the consumer arrives.
If you don't want this behavior — maybe it's a staging table you're building incrementally — set an explicit schedule instead:
SELECT pgtrickle.alter_stream_table(
name => 'staging_data',
schedule => '30s' -- explicit, not CALCULATED
);
CALCULATED + Diamond Dependencies
CALCULATED scheduling composes correctly with diamond dependencies. If two branches of the DAG converge at a common node:
A (5s) ──→ C (CALCULATED → 5s) ──→ E (10s)
B (2s) ──→ D (CALCULATED → 2s) ──→ E
Here:
Cinherits 5s fromA(but is also consumed byEat 10s; the tightest is 5s fromA's side — wait, let me clarify). Actually:C's schedule is CALCULATED based on its consumers.EconsumesCwith a 10s schedule. SoCgets 10s? No —Calso gets its own upstream's perspective.
Let me be precise. CALCULATED propagates downstream demand backward. C is consumed by E (10s). D is consumed by E (10s). So both C and D get 10s.
But A has a declared 5s schedule. And B has a declared 2s schedule. These are source schedules, not CALCULATED. The scheduler refreshes A every 5s and B every 2s. C and D refresh every 10s because that's what E needs.
The result: A and B refresh more often than C and D. The change buffers accumulate between C/D refreshes. When C and D refresh at 10s cadence, they process all accumulated changes in one batch. This is efficient — no wasted intermediate refreshes.
Monitoring Effective Schedules
You can see both declared and effective schedules:
SELECT
name,
schedule AS declared,
effective_schedule
FROM pgtrickle.pgt_status()
WHERE schedule = 'CALCULATED' OR schedule != effective_schedule;
name | declared | effective_schedule
-----------------+------------+--------------------
valid_orders | CALCULATED | 5s
customer_metrics| CALCULATED | 5s
If a CALCULATED table's effective schedule seems wrong, check the dependency tree:
SELECT * FROM pgtrickle.dependency_tree('valid_orders');
This shows the full DAG from the table to its consumers, making it easy to trace which consumer is driving the cadence.
When Not to Use CALCULATED
CALCULATED scheduling works well for intermediate pipeline tables. Don't use it for:
- Leaf tables that are queried directly. Give them an explicit schedule that matches your freshness requirement.
- Tables with side effects on refresh (e.g., stream tables feeding the outbox). These need a predictable, declared cadence.
- Debugging. When troubleshooting, explicit schedules are easier to reason about. Switch to CALCULATED after things stabilize.
Summary
CALCULATED scheduling inverts the refresh-planning problem. Instead of figuring out how fast each intermediate table needs to refresh, you declare freshness requirements on the tables you actually consume, and the DAG propagation derives everything else.
Add a new real-time consumer? The upstream tables speed up. Remove it? They slow down. Change the SLA? The whole pipeline adjusts.
One declaration, automatic propagation, zero manual tuning.
← Back to Blog Index | Documentation
Cycles in Your Dependency Graph? That's Fine.
Fixed-point iteration for monotone queries in pg_trickle
Dependency cycles are normally a fatal error in data pipelines. Airflow rejects them. dbt rejects them. Most IVM systems reject them. The logic is sound: if A depends on B and B depends on A, which one do you refresh first?
pg_trickle takes a different position. Some cycles are legitimate. A graph-reachability table that feeds a scoring table that feeds a threshold filter that feeds back into the reachability table — that's a cycle, but it has a well-defined fixed point. The system can iterate until nothing changes.
The key constraint: the queries in the cycle must be monotone. pg_trickle detects this at creation time and rejects non-monotone cycles. Monotone cycles converge. Non-monotone cycles can oscillate forever.
What Monotone Means
A query is monotone if adding rows to its input can only add rows to its output — never remove them. In practical terms:
Monotone operations:
SELECT ... WHERE ...(filter) — adding a row that passes the filter adds it to the outputJOIN(inner, left, right, full) — adding a row that matches adds join resultsUNION ALL— adding to either side adds to the outputGROUP BY ... HAVING ...with monotone aggregates (COUNT, SUM of non-negatives)
Non-monotone operations:
EXCEPT— adding to the right side removes from the outputNOT EXISTS— adding a matching row removes the anti-joined row- Aggregates that can decrease (
MIN,MAXwith deletions) DISTINCT(adding a duplicate doesn't add to the output)
pg_trickle checks monotonicity when you create a stream table that would form a cycle. If any table in the cycle uses a non-monotone operator, the creation fails:
ERROR: stream table 'filtered_scores' would create a non-monotone cycle
DETAIL: NOT EXISTS in query is not monotone
HINT: Remove the circular dependency or use a non-cyclic design
Enabling Circular Dependencies
Cycles are disabled by default:
-- This fails with cycles disabled
SHOW pg_trickle.allow_circular;
-- off
SELECT pgtrickle.create_stream_table(
name => 'scores',
query => $$ SELECT ... FROM reachable $$,
schedule => '5s'
);
-- ERROR: would create circular dependency (reachable → scores → reachable)
To allow monotone cycles:
SET pg_trickle.allow_circular = on;
Or in postgresql.conf:
pg_trickle.allow_circular = on
How Fixed-Point Iteration Works
When the scheduler encounters a cycle (a strongly connected component, or SCC, in the dependency graph), it doesn't try to topologically sort it — that's impossible for a cycle. Instead, it iterates:
- Refresh all tables in the SCC once, in an arbitrary order.
- Check for net changes. If any table in the SCC produced new rows, go to step 1.
- If no table produced new rows, the SCC has converged. Move on.
This is fixed-point iteration. The monotonicity constraint guarantees it terminates: each iteration can only add rows (never remove them), and the total result is bounded (finite source tables, finite joins, finite query results). Eventually, no new rows are produced.
A Concrete Example
Consider a fraud detection pipeline where:
suspicious_accountsflags accounts based on transaction patterns.risky_transactionsflags transactions involving suspicious accounts.suspicious_accountsalso considers accounts that have many risky transactions.
This is circular: suspicious accounts → risky transactions → suspicious accounts.
SET pg_trickle.allow_circular = on;
-- Table 1: Accounts flagged by direct pattern matching
SELECT pgtrickle.create_stream_table(
name => 'suspicious_accounts',
query => $$
SELECT account_id, 'pattern' AS reason
FROM transactions
GROUP BY account_id
HAVING COUNT(*) FILTER (WHERE amount > 10000) > 5
UNION ALL
-- Also flag accounts with many risky transactions
SELECT DISTINCT account_id, 'association' AS reason
FROM risky_transactions
GROUP BY account_id
HAVING COUNT(*) > 10
$$,
schedule => '10s'
);
-- Table 2: Transactions involving suspicious accounts
SELECT pgtrickle.create_stream_table(
name => 'risky_transactions',
query => $$
SELECT t.*
FROM transactions t
JOIN suspicious_accounts sa ON t.account_id = sa.account_id
WHERE t.amount > 1000
$$,
schedule => '10s'
);
On the first iteration:
suspicious_accountsfinds accounts with >5 high-value transactions (pattern match).risky_transactionsflags transactions by those accounts.
On the second iteration:
suspicious_accountsnow also sees the risky-transaction counts. Some new accounts cross the >10 threshold.risky_transactionspicks up transactions from the newly flagged accounts.
On the third iteration (typically):
- No new accounts are flagged. No new transactions are flagged. Convergence.
The Iteration Limit
To prevent runaway iteration (in case of a bug or a degenerate data pattern), pg_trickle enforces a limit:
SHOW pg_trickle.max_fixpoint_iterations;
-- 10
If a cycle doesn't converge within 10 iterations, pg_trickle stops, logs a warning, and marks the SCC as "not converged." The tables are still queryable — they just contain the result after 10 iterations, which may not be the complete fixed point.
WARNING: SCC {suspicious_accounts, risky_transactions} did not converge
after 10 iterations (12 new rows in last iteration)
You can increase the limit if your domain requires deeper iteration:
SET pg_trickle.max_fixpoint_iterations = 50;
Monitoring SCCs
pg_trickle exposes SCC status through the monitoring API:
SELECT * FROM pgtrickle.scc_status();
scc_id | tables | last_iterations | converged | last_refresh
--------+--------------------------------------------+-----------------+-----------+-------------------
1 | {suspicious_accounts,risky_transactions} | 3 | t | 2026-04-27 10:15:03
Key fields:
last_iterations: How many rounds it took to converge.converged: Whether the last refresh reached a fixed point.last_refresh: When the SCC was last fully resolved.
If converged is consistently false, your cycle may not be monotone in practice (even if the query structure passes the static check). Check whether data patterns cause oscillation.
Why Most Systems Reject Cycles
The standard argument against cycles is that they're a design smell — if your data model has circular dependencies, you should restructure it. That's usually right.
But some domains genuinely have circular relationships:
- Graph algorithms: Reachability, PageRank, label propagation — all defined as fixed points.
- Constraint propagation: Scheduling constraints that reference each other.
- Multi-phase classification: Where the output of one classifier feeds into another, and vice versa.
- Supply chain: Demand forecasts that depend on inventory, which depends on demand forecasts.
For these cases, "restructure your schema" is unhelpful. The circularity is in the domain, not in a modeling mistake.
Cycles and IMMEDIATE Mode
IMMEDIATE mode (synchronous in-transaction refresh) and cycles don't mix. If A triggers a refresh of B, which triggers a refresh of A, you get an infinite loop inside a single transaction.
pg_trickle rejects this at creation time:
ERROR: IMMEDIATE mode is not allowed for stream tables in a cycle
HINT: Use DIFFERENTIAL or AUTO mode with a schedule
Cyclic stream tables must use scheduled (asynchronous) refresh. The fixed-point iteration runs in the background scheduler, not inside a user transaction.
Performance Considerations
Each iteration of a cycle is a full refresh cycle for all tables in the SCC. The total cost is:
cost = iterations × Σ(cost per table in SCC)
For most practical cycles, convergence happens in 2–4 iterations. But if your cycle has 10 tables that each take 200ms to refresh, one convergence pass takes 2–8 seconds.
Optimization tips:
- Keep cycles small. Factor non-cyclic portions of the query out of the SCC.
- Use DIFFERENTIAL mode. Each iteration after the first typically processes only the new rows from the previous iteration.
- Set a tight
max_fixpoint_iterationsto fail fast if convergence is slow. - Monitor
scc_status()to track iteration counts. If they're consistently high, the cycle may need restructuring.
Summary
Circular dependencies aren't always a bug. pg_trickle allows them for monotone queries — queries where adding input rows can only add output rows. The system iterates until convergence (no new rows) or until the iteration limit.
Enable with pg_trickle.allow_circular = on. Monitor with scc_status(). Keep cycles small and monotone. And if convergence takes too many iterations, that's a signal to reconsider the design — not a reason to ban cycles entirely.
← Back to Blog Index | Documentation
Column-Level Lineage in One Function Call
Know exactly which source columns feed your dashboard metrics
"If I drop the discount_pct column from orders, which stream tables break?"
This question seems simple, but answering it requires tracing through every stream table's defining query, following joins and aggregates, and mapping output columns back to source columns. For a DAG with 30 stream tables, doing this manually is an afternoon of SQL parsing.
pg_trickle's stream_table_lineage() does it in one function call.
The API
SELECT * FROM pgtrickle.stream_table_lineage('revenue_by_region');
output_column | source_table | source_column | transform
---------------+--------------+---------------+-----------
region | customers | region | direct
revenue | orders | total | SUM
order_count | orders | id | COUNT
avg_order | orders | total | AVG
Each row maps one output column of the stream table to the source table and column it derives from, along with the transformation applied.
What Lineage Tells You
Impact Analysis
Before altering a source table, check what depends on it:
-- Which stream tables reference orders.discount_pct?
SELECT DISTINCT l.stream_table
FROM pgtrickle.pgt_stream_tables st
CROSS JOIN LATERAL pgtrickle.stream_table_lineage(st.name) l
WHERE l.source_table = 'orders' AND l.source_column = 'discount_pct';
stream_table
------------------
order_summary
revenue_by_region
discount_analysis
Now you know: dropping discount_pct affects three stream tables. You can plan the migration — update the stream table queries first, then drop the column.
GDPR Column Deletion
A user requests deletion of their data. You need to know everywhere their PII appears:
-- Which stream tables contain data derived from customers.email?
SELECT l.stream_table, l.output_column, l.transform
FROM pgtrickle.pgt_stream_tables st
CROSS JOIN LATERAL pgtrickle.stream_table_lineage(st.name) l
WHERE l.source_table = 'customers' AND l.source_column = 'email';
If email appears in a stream table's output (even through a join), the lineage trace finds it. You know exactly which stream tables need updating and which output columns contain PII.
Documentation Generation
Auto-generate a data dictionary from lineage:
-- All lineage for all stream tables
SELECT
st.name AS stream_table,
l.output_column,
l.source_table || '.' || l.source_column AS source,
l.transform
FROM pgtrickle.pgt_stream_tables st
CROSS JOIN LATERAL pgtrickle.stream_table_lineage(st.name) l
ORDER BY st.name, l.output_column;
Export this as CSV or pipe it into your documentation system. Every time a stream table changes, re-run the query to get the updated lineage.
Transformations
The transform column describes how the source column becomes the output column:
| Transform | Meaning |
|---|---|
direct | Column passed through unchanged (possibly renamed) |
SUM | Aggregated via SUM |
COUNT | Aggregated via COUNT |
AVG | Aggregated via AVG |
MIN / MAX | Aggregated via MIN/MAX |
expression | Used in a computed expression (e.g., total * quantity) |
filter | Used in WHERE/HAVING (doesn't appear in output, but affects result) |
join_key | Used as a join condition |
window | Used in a window function |
For computed expressions, the lineage may show multiple source columns mapping to the same output column:
-- query: SELECT customer_id, total * quantity AS line_total FROM ...
SELECT * FROM pgtrickle.stream_table_lineage('line_items_expanded');
output_column | source_table | source_column | transform
---------------+--------------+---------------+------------
customer_id | orders | customer_id | direct
line_total | orders | total | expression
line_total | order_items | quantity | expression
Both total and quantity contribute to line_total.
Chained Lineage
When stream tables reference other stream tables (A → B → C), lineage by default shows only the immediate sources:
SELECT * FROM pgtrickle.stream_table_lineage('dashboard_summary');
output_column | source_table | source_column | transform
---------------+------------------+---------------+-----------
total_revenue | customer_metrics | lifetime_value| SUM
Here, customer_metrics is itself a stream table. To trace all the way back to base tables, call lineage recursively:
-- Transitive lineage: trace to base tables
WITH RECURSIVE full_lineage AS (
-- Start with the target stream table
SELECT
'dashboard_summary' AS stream_table,
output_column, source_table, source_column, transform
FROM pgtrickle.stream_table_lineage('dashboard_summary')
UNION ALL
-- Recurse through intermediate stream tables
SELECT
fl.source_table,
l.output_column, l.source_table, l.source_column, l.transform
FROM full_lineage fl
JOIN pgtrickle.pgt_stream_tables st ON st.name = fl.source_table
CROSS JOIN LATERAL pgtrickle.stream_table_lineage(st.name) l
WHERE l.output_column = fl.source_column
)
SELECT * FROM full_lineage
WHERE source_table NOT IN (SELECT name FROM pgtrickle.pgt_stream_tables);
This traces through the entire DAG and returns only the base-table lineage. For dashboard_summary → customer_metrics → orders, it returns the orders columns that ultimately feed the dashboard.
Lineage for Debugging
When a stream table's results look wrong, lineage helps narrow down where the problem is:
- Check lineage to identify which source columns feed the suspicious output column.
- Query the source tables directly to verify the data.
- If the source data is correct, the bug is in the transform (query logic).
- If the source data is wrong, the problem is upstream.
This is especially useful for deep DAGs where a bug in a base table ripples through multiple levels.
Performance
stream_table_lineage() parses the stream table's defining query and traces column references through the OpTree (pg_trickle's internal query representation). It doesn't execute any queries against the actual data.
The cost is proportional to the complexity of the defining query — number of columns, joins, and subqueries. For a typical 10-column query with 3 joins, it returns in under 1ms.
For a recursive trace across the full DAG, the cost multiplies by the number of stream tables in the chain. A 5-level DAG with 20 stream tables typically completes in under 10ms.
Summary
stream_table_lineage() maps output columns to source columns in one function call. Use it for impact analysis before schema changes, GDPR compliance audits, documentation generation, and debugging.
For multi-level DAGs, compose it with a recursive CTE to trace all the way to base tables.
It's the kind of feature you don't think about until you're staring at 30 stream tables trying to figure out which one depends on the column you're about to drop. Then it's the most useful function in the extension.
← Back to Blog Index | Documentation
Compliance and Audit Trails with Append-Only Stream Tables
Building GDPR-compliant, tamper-evident audit logs as stream tables — including right-to-erasure reconciliation
Every regulated industry needs audit trails. Financial services, healthcare, government — they all require an immutable record of who did what, when, and why. The typical implementation is a audit_log table that receives INSERTs from application code or triggers. It grows forever, nobody queries it until an auditor shows up, and then someone discovers it's missing half the events because a developer forgot to add logging to a new endpoint.
pg_trickle offers a more robust approach. Instead of instrumenting every write path with explicit audit logging, you define audit views as stream tables that automatically derive the audit record from the operational data's change history. The change buffer tables that pg_trickle maintains for incremental view maintenance are themselves a complete, append-only record of every modification to the source tables. You're getting an audit trail as a side effect of the performance optimization.
The Change Buffer as Audit Source
When you register a table as a source for a stream table, pg_trickle installs CDC triggers that capture every INSERT, UPDATE, and DELETE into a change buffer table. Each change record includes:
- The full before-image (for UPDATEs and DELETEs)
- The full after-image (for INSERTs and UPDATEs)
- A timestamp
- The transaction ID
- The operation type (I/U/D)
This is exactly the information an audit trail needs. The difference from traditional audit logging is that it's comprehensive by construction — if a table is a stream table source, every change is captured, regardless of which application code path triggered it. No missed events, no forgotten instrumentation.
Deriving Audit Views
Instead of querying raw change buffers (which have an internal schema optimized for delta processing), you can define audit views as stream tables themselves:
-- Patient record modification audit trail (HIPAA requirement)
SELECT pgtrickle.create_stream_table(
'patient_audit_trail',
$$
SELECT
p.id AS patient_id,
p.name AS patient_name,
p.modified_by AS last_modified_by,
p.modified_at AS last_modification_time,
COUNT(*) AS total_modifications
FROM patients p
GROUP BY p.id, p.name, p.modified_by, p.modified_at
$$
);
For a richer audit trail that tracks every change (not just the latest state), you can combine the operational table with a separate audit events table:
-- Application writes audit events alongside data changes
CREATE TABLE audit_events (
id bigserial PRIMARY KEY,
table_name text NOT NULL,
record_id bigint NOT NULL,
action text NOT NULL, -- 'CREATE', 'UPDATE', 'DELETE'
actor text NOT NULL,
changed_fields jsonb,
old_values jsonb,
new_values jsonb,
reason text,
created_at timestamptz DEFAULT now()
);
-- Stream table: compliance summary per record
SELECT pgtrickle.create_stream_table(
'compliance_modification_summary',
$$
SELECT
table_name,
record_id,
COUNT(*) AS change_count,
COUNT(DISTINCT actor) AS distinct_actors,
MIN(created_at) AS first_change,
MAX(created_at) AS latest_change,
COUNT(*) FILTER (WHERE action = 'DELETE') AS delete_count
FROM audit_events
GROUP BY table_name, record_id
$$
);
This summary updates incrementally as audit events flow in. An auditor can instantly see which records have been modified most frequently, which records have been deleted, and which actors have been most active — without scanning the full audit log.
GDPR Right-to-Erasure Reconciliation
The tension between "immutable audit trail" and "right to be forgotten" is one of the hardest problems in compliance engineering. GDPR Article 17 gives individuals the right to have their personal data deleted. But your audit log says you must never delete records. How do you reconcile?
The standard approach is pseudonymization: replace identifying information with opaque tokens, preserving the audit trail's structure (who did what to which record) while removing the ability to identify the individual.
With pg_trickle, you can maintain a "pseudonymized audit view" that automatically reflects erasure operations:
-- Erasure requests table
CREATE TABLE erasure_requests (
id serial PRIMARY KEY,
subject_id bigint NOT NULL, -- the person requesting erasure
requested_at timestamptz DEFAULT now(),
completed_at timestamptz
);
-- Pseudonymized audit view: masks erased subjects
SELECT pgtrickle.create_stream_table(
'pseudonymized_audit',
$$
SELECT
a.id AS event_id,
a.table_name,
a.record_id,
a.action,
CASE
WHEN er.id IS NOT NULL THEN 'REDACTED-' || md5(a.actor)
ELSE a.actor
END AS actor,
a.created_at,
CASE
WHEN er.id IS NOT NULL THEN NULL
ELSE a.changed_fields
END AS changed_fields
FROM audit_events a
LEFT JOIN erasure_requests er
ON er.subject_id = (a.new_values->>'subject_id')::bigint
AND er.completed_at IS NOT NULL
$$
);
When an erasure request is completed (the completed_at field is set), the stream table automatically redacts all audit events associated with that subject. The audit trail still shows that actions occurred (preserving regulatory requirements for financial record-keeping), but personal identifiers are replaced with irreversible hashes.
The incremental nature means that processing an erasure request doesn't require scanning the entire audit log. pg_trickle identifies the audit events affected by the new erasure record (via the join) and updates only those rows in the pseudonymized view.
Tamper Evidence
An audit trail is worthless if it can be silently modified. Deleted audit records, altered timestamps, and changed actor fields undermine the entire system. Traditional approaches to tamper evidence include:
- Hash chains (each record includes a hash of the previous record)
- Write-once storage (append-only file systems)
- External witnesses (send a hash to a timestamping service)
pg_trickle's change buffers provide a degree of tamper evidence by design. Because the CDC triggers capture every modification to source tables, any attempt to alter the audit events table itself would be captured as a change event. You'd need to disable the trigger, modify the data, and re-enable the trigger — which requires superuser access and leaves gaps in the sequence numbering.
For stronger tamper evidence, combine stream tables with a hash chain:
-- Maintain a running hash chain over audit events
SELECT pgtrickle.create_stream_table(
'audit_chain',
$$
SELECT
a.id AS event_id,
a.action,
a.actor,
a.created_at,
md5(
a.id::text ||
a.action ||
a.actor ||
a.created_at::text ||
COALESCE(
(SELECT md5_hash FROM audit_chain_prev WHERE event_id = a.id - 1),
'GENESIS'
)
) AS chain_hash
FROM audit_events a
$$
);
Any modification to a historical audit event would break the hash chain from that point forward — a tampering attempt becomes immediately detectable by verifying the chain.
Access Pattern Monitoring
Beyond recording data changes, compliance often requires monitoring who accesses sensitive data and how often. Stream tables can maintain access pattern summaries:
-- Track data access patterns for compliance reporting
CREATE TABLE data_access_log (
id bigserial PRIMARY KEY,
accessor text NOT NULL,
resource text NOT NULL,
access_type text NOT NULL, -- 'READ', 'EXPORT', 'SHARE'
accessed_at timestamptz DEFAULT now()
);
SELECT pgtrickle.create_stream_table(
'access_pattern_summary',
$$
SELECT
accessor,
resource,
access_type,
date_trunc('day', accessed_at) AS day,
COUNT(*) AS access_count
FROM data_access_log
GROUP BY accessor, resource, access_type, date_trunc('day', accessed_at)
$$
);
-- Anomaly detection: who accessed more than usual?
SELECT pgtrickle.create_stream_table(
'access_anomalies',
$$
SELECT
accessor,
resource,
day,
access_count,
AVG(access_count) OVER (
PARTITION BY accessor, resource
ORDER BY day
ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
) AS rolling_avg_30d
FROM access_pattern_summary
$$
);
The anomaly detection stream table compares today's access count against the 30-day rolling average. Spikes — an employee suddenly downloading 100x their normal volume of patient records — are immediately visible without running expensive ad-hoc queries.
Retention Policies
Compliance regulations specify how long audit data must be retained. SOX requires 7 years for financial records. HIPAA requires 6 years. After the retention period, data should be purged — both for storage efficiency and to limit the blast radius of a breach.
With stream tables, you can implement retention-aware views:
SELECT pgtrickle.create_stream_table(
'active_audit_records',
$$
SELECT *
FROM audit_events
WHERE created_at > now() - interval '7 years'
$$
);
Records that age past the retention window automatically disappear from the stream table. The source audit events table can be partitioned by time, with old partitions archived to cold storage or dropped after the retention period.
The stream table's incremental maintenance handles the window expiry naturally: as records age out, they're subtracted from any aggregates that reference them. Your compliance dashboard always shows the correct counts within the retention window, without manual recalculation.
Putting It Together
A complete compliance architecture with pg_trickle:
- Source tables with CDC triggers (automatic with stream tables) — captures every data modification
- Audit events table — explicit audit log for application-level actions (login, export, share)
- Pseudonymized audit view (stream table) — GDPR-safe view with automatic redaction on erasure
- Access pattern summary (stream table) — incremental aggregation of who accessed what
- Compliance dashboard (stream table) — high-level metrics (total changes, distinct actors, anomalies)
Each layer is incrementally maintained. The cost of compliance monitoring scales with the rate of changes, not the volume of historical data. An auditor querying the compliance dashboard reads from pre-computed stream tables — no full scans, no waiting, no stale reports.
Compliance doesn't have to be expensive. When your audit trail maintains itself incrementally, you get tamper evidence, GDPR compatibility, and instant compliance reports — all as side effects of how pg_trickle already works.
← Back to Blog Index | Documentation
The Cost Model: How pg_trickle Decides Whether to Refresh Differentially
Inside the AUTO mode decision engine
pg_trickle supports three refresh modes: FULL (recompute everything), DIFFERENTIAL (apply only the delta), and IMMEDIATE (apply the delta in the source transaction).
There's a fourth option: AUTO. With AUTO mode, pg_trickle decides on each refresh cycle whether to use DIFFERENTIAL or FULL, based on a learned cost model.
Why Not Always Use DIFFERENTIAL?
DIFFERENTIAL refresh is faster when the delta is small relative to the total data. If 10 rows changed out of 10 million, DIFFERENTIAL processes 10 rows. FULL scans 10 million.
But DIFFERENTIAL has overhead that FULL doesn't:
- Change buffer management. Reading from the change buffer, deduplicating, ordering.
- Delta computation. Running the delta rules through the operator tree.
- Merge complexity. The MERGE statement for DIFFERENTIAL is more complex than a simple INSERT.
- State maintenance. The engine maintains auxiliary data structures (group state for aggregates, join indexes for delta joins).
When the delta is large — say, 60% of the table was updated in a bulk operation — the DIFFERENTIAL overhead can exceed the cost of just recomputing from scratch. At that point, FULL refresh is faster.
The crossover point depends on the query, the data distribution, the table size, and the hardware. There's no universal threshold.
The AUTO Mode Cost Model
When you set refresh_mode => 'AUTO', pg_trickle evaluates a cost estimate at the beginning of each refresh cycle:
SELECT pgtrickle.create_stream_table(
'orders_summary',
$$SELECT region, SUM(amount), COUNT(*)
FROM orders GROUP BY region$$,
schedule => '5s',
refresh_mode => 'AUTO'
);
The Decision Inputs
- Delta size. The number of rows in the change buffer since the last refresh.
- Source table size. The estimated row count of each source table (from
pg_class.reltuples). - Delta ratio.
delta_size / source_table_size— what fraction of the source changed. - Query complexity. Number of JOINs, aggregation groups, subqueries.
- Historical refresh times. How long FULL and DIFFERENTIAL refreshes have actually taken for this stream table (learned from
pgt_refresh_history).
The Decision
The cost model estimates:
estimated_diff_cost = f(delta_size, query_complexity, join_count) + overhead
estimated_full_cost = g(source_table_size, query_complexity)
If estimated_diff_cost < estimated_full_cost × safety_margin, use DIFFERENTIAL. Otherwise, use FULL.
The safety_margin (default: 0.8) biases toward DIFFERENTIAL — it's preferred unless FULL is clearly cheaper. This is because DIFFERENTIAL has lower I/O impact (it doesn't scan the entire source table) and doesn't hold locks as long.
The Learning Component
The cost model starts with conservative estimates based on table statistics. After each refresh, it records the actual cost (wall-clock time, rows processed, I/O). Over time, the model learns the actual FULL and DIFFERENTIAL costs for each stream table and refines its estimates.
You can see the model's current state:
SELECT * FROM pgtrickle.explain_refresh_mode('orders_summary');
This returns:
stream_table | current_mode | last_full_ms | last_diff_ms | delta_ratio | threshold | recommendation
----------------+--------------+--------------+--------------+-------------+-----------+---------------
orders_summary | AUTO(DIFF) | 450.2 | 3.1 | 0.0001 | 0.35 | DIFFERENTIAL
threshold is the delta ratio above which the model switches to FULL. In this case, if more than 35% of the source table changes in a single cycle, the model will choose FULL refresh.
When AUTO Switches to FULL
Bulk loads
-- A bulk import inserts 500,000 rows into a 1,000,000-row table
COPY orders FROM '/data/import.csv';
The change buffer now has 500,000 rows. Delta ratio: 50%. The cost model says: this is more than the threshold (35%). FULL refresh will be faster.
The scheduler runs a FULL refresh for this cycle, then switches back to DIFFERENTIAL for subsequent cycles (which presumably have normal-sized deltas).
Initial population
When a stream table is first created, the initial population is always a FULL refresh — there's no delta to apply. After the first cycle, AUTO mode begins evaluating.
Periodic catch-up
If the scheduler falls behind (e.g., the database was under heavy load and skipped several cycles), the accumulated change buffer might exceed the threshold. AUTO mode will catch up with a FULL refresh rather than processing a massive delta incrementally.
Overriding AUTO
You can always override the model's decision with a manual refresh:
-- Force a FULL refresh regardless of the cost model
SELECT pgtrickle.refresh_stream_table('orders_summary', force_mode => 'FULL');
-- Force a DIFFERENTIAL refresh
SELECT pgtrickle.refresh_stream_table('orders_summary', force_mode => 'DIFFERENTIAL');
And you can adjust the threshold:
-- More aggressive DIFFERENTIAL (switch to FULL only above 50% delta ratio)
SELECT pgtrickle.alter_stream_table('orders_summary', auto_threshold => 0.5);
-- More aggressive FULL (switch to FULL above 10% delta ratio)
SELECT pgtrickle.alter_stream_table('orders_summary', auto_threshold => 0.1);
Monitoring AUTO Decisions
The refresh history records which mode was chosen for each cycle:
SELECT
refreshed_at,
refresh_mode,
duration_ms,
rows_affected,
delta_size
FROM pgtrickle.get_refresh_history('orders_summary')
ORDER BY refreshed_at DESC
LIMIT 20;
refreshed_at | refresh_mode | duration_ms | rows_affected | delta_size
-------------------------+--------------+-------------+---------------+-----------
2026-04-27 10:15:03.412 | DIFFERENTIAL | 3.1 | 12 | 45
2026-04-27 10:14:58.301 | DIFFERENTIAL | 2.8 | 8 | 30
2026-04-27 10:14:53.198 | FULL | 452.1 | 10200 | 505000
2026-04-27 10:14:48.100 | DIFFERENTIAL | 3.5 | 15 | 52
You can see the bulk import at 10:14:53 triggered a FULL refresh (delta_size = 505,000). The subsequent cycles returned to DIFFERENTIAL.
The Recommendation Engine
If you're not sure which mode to use, pg_trickle can recommend:
SELECT * FROM pgtrickle.recommend_refresh_mode('orders_summary');
This analyzes the query structure, current table sizes, and historical refresh patterns to suggest DIFFERENTIAL, FULL, IMMEDIATE, or AUTO.
It also tells you the refresh efficiency — how much work DIFFERENTIAL saves compared to FULL:
SELECT * FROM pgtrickle.refresh_efficiency('orders_summary');
stream_table | avg_diff_ms | avg_full_ms | efficiency_ratio | recommendation
----------------+-------------+-------------+------------------+---------------
orders_summary | 3.2 | 450.0 | 140.6x | DIFFERENTIAL
An efficiency ratio of 140× means DIFFERENTIAL is 140 times faster than FULL on average. For this stream table, there's no reason to use FULL except during bulk loads — which AUTO handles automatically.
When to Use Each Mode
| Mode | Best for |
|---|---|
| DIFFERENTIAL | Most workloads: continuous changes, small deltas |
| FULL | Bulk loads, complex non-differentiable queries |
| IMMEDIATE | Read-your-writes consistency, low write throughput |
| AUTO | Mixed workloads: normal traffic + periodic bulk imports |
If you're unsure, start with AUTO. It'll do the right thing for most workloads and you can always switch to a fixed mode if you want more predictability.
← Back to Blog Index | Documentation
CQRS Without a Second Database
Command Query Responsibility Segregation using stream tables instead of a separate read store
CQRS is a clean idea with an ugly implementation.
The pattern says: separate your write model (normalized, optimized for transactions) from your read model (denormalized, optimized for queries). Writes go to one place, reads from another.
The standard implementation: writes go to PostgreSQL, a CDC pipeline (Debezium, usually) reads the WAL, transforms the events, and writes them to a separate read store (Elasticsearch, DynamoDB, a read replica with different indexes, sometimes a second PostgreSQL instance with materialized views).
Now you're maintaining two databases, a CDC pipeline, a schema registry, retry logic, monitoring for the pipeline, and a runbook for when the pipeline breaks and your read model is 20 minutes behind.
pg_trickle collapses this into one PostgreSQL instance.
The Architecture
Traditional CQRS:
Application
│ writes ─→ PostgreSQL (write model)
│ │
│ ▼
│ Debezium CDC ──→ Kafka ──→ Consumer ──→ Elasticsearch (read model)
│ │
└── reads ←────────────────────────────────────────────────┘
pg_trickle CQRS:
Application
│ writes ─→ PostgreSQL
│ │
│ CDC triggers ──→ stream tables (read model)
│ │
└── reads ←──────────────────────────┘
The write model is your normalized tables. The read model is stream tables — denormalized projections that pg_trickle maintains automatically.
A Concrete Example: Order Management
The write model is a standard normalized schema:
CREATE TABLE customers (
id bigint PRIMARY KEY,
name text NOT NULL,
email text NOT NULL,
tier text NOT NULL DEFAULT 'standard'
);
CREATE TABLE orders (
id bigserial PRIMARY KEY,
customer_id bigint NOT NULL REFERENCES customers(id),
status text NOT NULL DEFAULT 'pending',
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE order_items (
id bigserial PRIMARY KEY,
order_id bigint NOT NULL REFERENCES orders(id),
product_id bigint NOT NULL,
quantity int NOT NULL,
unit_price numeric(10,2) NOT NULL
);
CREATE TABLE products (
id bigint PRIMARY KEY,
name text NOT NULL,
category text NOT NULL
);
The application writes to these tables with normal INSERT/UPDATE/DELETE statements. The schema is normalized — no redundancy, no denormalization trade-offs.
The Read Model
The API needs a flat "order detail" view that includes everything in one query: order info, customer name, line items with product names, and totals.
Without pg_trickle, this is either a complex JOIN query on every API request (slow at scale), or a denormalized table maintained by application code (error-prone and stale).
With pg_trickle:
-- Order detail view: everything the API needs in one row per order
SELECT pgtrickle.create_stream_table(
'order_detail_view',
$$SELECT
o.id AS order_id,
o.status,
o.created_at,
c.name AS customer_name,
c.email AS customer_email,
c.tier AS customer_tier,
SUM(oi.quantity * oi.unit_price) AS order_total,
COUNT(oi.id) AS line_item_count,
array_agg(DISTINCT p.category) AS categories
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
GROUP BY o.id, o.status, o.created_at,
c.name, c.email, c.tier$$,
refresh_mode => 'IMMEDIATE'
);
With IMMEDIATE mode, the stream table is updated in the same transaction as the write. The API reads from order_detail_view and gets a pre-joined, pre-aggregated result with zero lag.
Read-Your-Writes
The critical property for CQRS in a web application is read-your-writes consistency. When a user places an order, the confirmation page should show that order immediately.
With a traditional CDC-based read model, there's a lag: the order is written, the CDC pipeline picks it up, transforms it, writes it to the read store. The confirmation page might not see the order for a few seconds.
With IMMEDIATE mode, the stream table is updated within the same PostgreSQL transaction. The API can read from the stream table in the same request that wrote the order:
BEGIN;
-- Write the order
INSERT INTO orders (customer_id) VALUES (42) RETURNING id;
-- order_id = 1001
-- Write line items
INSERT INTO order_items (order_id, product_id, quantity, unit_price)
VALUES (1001, 7, 2, 29.99), (1001, 12, 1, 49.99);
-- Read the fully-denormalized view — already reflects the new order
SELECT * FROM order_detail_view WHERE order_id = 1001;
COMMIT;
No polling. No retry. No "wait 2 seconds and then refresh the page."
Multiple Read Models
Different API consumers need different projections. The customer portal needs order history. The warehouse needs picking lists. The finance team needs revenue summaries.
-- Customer's order history (for the customer portal)
SELECT pgtrickle.create_stream_table(
'customer_order_history',
$$SELECT
c.id AS customer_id,
c.name,
COUNT(o.id) AS total_orders,
SUM(oi.quantity * oi.unit_price) AS lifetime_spend,
MAX(o.created_at) AS last_order_at
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
LEFT JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.id, c.name$$,
refresh_mode => 'IMMEDIATE'
);
-- Picking list (for the warehouse)
SELECT pgtrickle.create_stream_table(
'warehouse_picking_list',
$$SELECT
o.id AS order_id,
p.name AS product_name,
p.category,
oi.quantity,
o.created_at AS ordered_at
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
WHERE o.status = 'confirmed'$$,
schedule => '2s',
refresh_mode => 'DIFFERENTIAL'
);
-- Revenue summary (for finance dashboards)
SELECT pgtrickle.create_stream_table(
'revenue_summary',
$$SELECT
date_trunc('day', o.created_at) AS day,
p.category,
SUM(oi.quantity * oi.unit_price) AS revenue,
COUNT(DISTINCT o.id) AS order_count
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
WHERE o.status IN ('confirmed', 'shipped', 'delivered')
GROUP BY date_trunc('day', o.created_at), p.category$$,
schedule => '5s',
refresh_mode => 'DIFFERENTIAL'
);
Notice the mixed modes: the customer portal uses IMMEDIATE (read-your-writes matters), the warehouse uses DIFFERENTIAL with a 2-second schedule (a small delay is fine), the finance dashboard uses a 5-second schedule (freshness is nice but not critical).
Each read model is independently maintained. Adding a new one doesn't affect the others.
What You Don't Need Anymore
With stream tables as your read model:
-
No CDC pipeline. No Debezium, no Kafka Connect, no custom consumers. Change capture is built into pg_trickle's triggers.
-
No separate read database. The read model lives in the same PostgreSQL instance. Same backup, same monitoring, same connection string.
-
No schema synchronization. When the write model schema changes, you update the stream table's query with
alter_stream_table. No schema registry, no Avro evolution rules. -
No consistency reconciliation. The read model is maintained transactionally. There's no need for periodic reconciliation jobs to fix drift between the write and read models.
-
No read replica lag monitoring. Stream tables don't use PostgreSQL replication. They're regular tables maintained by the same instance.
When the Single-Database Approach Breaks Down
pg_trickle's CQRS works inside a single PostgreSQL instance (or Citus cluster). It breaks down when:
-
Read and write workloads need separate scaling. If your reads need 10× the compute of your writes, you might want a separate read cluster. pg_trickle can still help here: use the outbox + relay to replicate stream table deltas to a read-only PostgreSQL replica.
-
The read model needs a different query engine. If your read model is a full-text search index that needs inverted indexes with BM25 scoring, Elasticsearch is the right tool. (Though PostgreSQL's full-text search with GIN indexes is surprisingly capable.)
-
Terabyte-scale analytics. If your read model is a columnar analytics store processing terabytes, a dedicated OLAP system (ClickHouse, DuckDB) is more appropriate.
For the vast majority of CQRS use cases — operational read models, denormalized API views, real-time dashboards — a single PostgreSQL instance with stream tables is simpler, cheaper, and more correct than the traditional multi-system architecture.
← Back to Blog Index | Documentation
dbt + pg_trickle: The Analytics Engineer's Stack
Using dbt to manage stream tables that actually stay fresh
dbt transformed how analytics teams write SQL. You define models, dbt handles dependencies, testing, and documentation. You run dbt run, your warehouse updates.
But dbt has a freshness problem. dbt run is a batch operation. Between runs, your models are stale. Some teams run dbt every hour. Aggressive teams run it every 15 minutes. Very few run it more often than that, because the full model graph takes time to rebuild.
pg_trickle solves the freshness side. dbt solves the governance side. Together they give you continuously-fresh models that are also version-controlled, tested, and documented.
What dbt-pgtrickle Does
dbt-pgtrickle is a dbt package that adds a pgtrickle materialization strategy. Instead of creating a table or view, it creates a stream table:
# models/order_totals.sql
{{
config(
materialized='pgtrickle',
schedule='5s',
refresh_mode='DIFFERENTIAL'
)
}}
SELECT
customer_id,
SUM(amount) AS total_spend,
COUNT(*) AS order_count
FROM {{ ref('orders') }}
GROUP BY customer_id
When you run dbt run, dbt-pgtrickle calls pgtrickle.create_stream_table() (or create_or_replace_stream_table() if it already exists). The stream table is then maintained by pg_trickle's background worker — dbt doesn't need to refresh it.
The Workflow
Initial Setup
# Add to packages.yml
packages:
- package: grove/dbt_pgtrickle
version: ">=0.36.0"
# Install
dbt deps
Defining Models
dbt models with the pgtrickle materialization look like regular SQL models, plus a few config parameters:
# models/schema.yml
models:
- name: revenue_by_region
config:
materialized: pgtrickle
schedule: '3s'
refresh_mode: DIFFERENTIAL
columns:
- name: region
tests:
- not_null
- name: revenue
tests:
- not_null
-- models/revenue_by_region.sql
{{
config(
materialized='pgtrickle',
schedule='3s',
refresh_mode='DIFFERENTIAL'
)
}}
SELECT
c.region,
date_trunc('day', o.created_at) AS order_date,
SUM(o.amount) AS revenue,
COUNT(*) AS order_count
FROM {{ source('app', 'orders') }} o
JOIN {{ source('app', 'customers') }} c ON c.id = o.customer_id
GROUP BY c.region, date_trunc('day', o.created_at)
Running
# First run: creates all stream tables
dbt run
# Subsequent runs: recreates only models whose SQL changed
dbt run
# Full refresh: drops and recreates all stream tables
dbt run --full-refresh
dbt run is idempotent. If the model SQL hasn't changed, create_or_replace_stream_table detects the unchanged query and skips recreation. If the SQL has changed, it performs an online query migration — the stream table stays queryable during the rebuild.
Testing
dbt tests work normally on stream tables. They're regular PostgreSQL tables, so dbt test runs SELECT queries against them:
# Run all tests
dbt test
# Test a specific model
dbt test --select revenue_by_region
One thing to be aware of: since stream tables update asynchronously (for DIFFERENTIAL mode), there's a small window where a test might fail because the latest source data hasn't propagated yet. The dbt-pgtrickle package adds a pgtrickle_wait_fresh test helper that blocks until the stream table's frontier is current:
-- tests/revenue_by_region_is_fresh.sql
SELECT * FROM pgtrickle.get_staleness('revenue_by_region')
WHERE staleness > interval '10 seconds'
Documentation
dbt docs generate works as expected. Stream tables appear in the documentation graph with their dependencies. The dbt-pgtrickle package adds custom metadata to the docs: schedule, refresh mode, average refresh latency, and last refresh timestamp.
DAG Integration
dbt manages the dependency graph between models. pg_trickle manages the refresh dependency graph between stream tables. These two DAGs align automatically when you use {{ ref() }}:
-- models/silver_orders.sql (stream table)
{{ config(materialized='pgtrickle', schedule='2s', refresh_mode='DIFFERENTIAL') }}
SELECT o.*, c.region, c.tier
FROM {{ source('app', 'orders') }} o
JOIN {{ source('app', 'customers') }} c ON c.id = o.customer_id
-- models/gold_revenue.sql (stream table, depends on silver)
{{ config(materialized='pgtrickle', schedule='3s', refresh_mode='DIFFERENTIAL') }}
SELECT region, SUM(amount) AS revenue, COUNT(*) AS orders
FROM {{ ref('silver_orders') }}
GROUP BY region
When dbt builds gold_revenue, it knows to build silver_orders first. pg_trickle's scheduler knows that gold_revenue depends on silver_orders and won't refresh it until the upstream is current.
Mixing Materializations
Not every model needs to be a stream table. You can mix materializations freely:
models:
- name: raw_events # materialized: table (standard dbt)
- name: cleaned_events # materialized: pgtrickle (stream table)
- name: event_summary # materialized: pgtrickle (stream table)
- name: monthly_report # materialized: table (batch, runs nightly)
The stream tables update continuously. The batch tables update when you run dbt run. pg_trickle only manages the stream tables — the rest are standard dbt.
Freshness Checks
dbt's source freshness feature checks whether source tables have been updated recently. For stream tables, you can also check whether the stream table itself is fresh:
sources:
- name: stream_tables
freshness:
warn_after: {count: 10, period: second}
error_after: {count: 30, period: second}
loaded_at_field: data_timestamp
tables:
- name: revenue_by_region
dbt source freshness will now alert if revenue_by_region is more than 30 seconds stale — meaning pg_trickle's scheduler has fallen behind.
When to Use Which
| Scenario | Materialization |
|---|---|
| Data changes continuously, needs to be fresh | pgtrickle |
| Data is loaded in bulk nightly, batch is fine | table |
| Lightweight derivation, no storage needed | view |
| Read-your-writes required in the application | pgtrickle with IMMEDIATE mode |
| Historical snapshots for audit trail | snapshot (standard dbt) |
The rule of thumb: if you're running dbt run more than once an hour because users complain about stale data, switch that model to pgtrickle materialization and let pg_trickle handle the freshness.
← Back to Blog Index | Documentation
Deploying RAG at Scale: pg_trickle as Your Embedding Infrastructure
Post-refresh hooks, drift-aware reindexing, and the operational reality of production vector search
You've built a RAG application. The embeddings are in PostgreSQL. pgvector is handling the approximate nearest-neighbor search. pg_trickle is keeping your stream tables fresh. It works in development. It works with 10,000 documents.
Now you need it to work with 50 million documents, 200 tenants, a 99.9% uptime SLA, and an on-call engineer who has never heard of IVFFlat.
This post is about the operational side of running pgvector at scale with pg_trickle — the things that don't come up in tutorials but determine whether your system survives its first Black Friday.
The Three Problems Nobody Warns You About
Once you've solved embedding freshness (stream tables keep vectors current), three new problems emerge:
- Index drift. Your HNSW or IVFFlat index gets less accurate over time as data changes, but nothing tells you when to rebuild it.
- Operational blindness. You have no visibility into embedding lag, index health, or which stream tables are falling behind.
- Multi-tenant isolation. Shared ANN indexes are fast but leak performance characteristics across tenants. Separate indexes per tenant are correct but expensive to manage.
These aren't pgvector problems. They're operational problems that appear at scale with any vector infrastructure. The question is whether your tooling helps you manage them or whether you're building monitoring dashboards from scratch.
Problem 1: Index Drift Is Silent and Cumulative
Here's what happens to an HNSW index over time.
When you first create the index, it builds a navigable small-world graph over your current data. Every vector is a node. Edges connect nearby nodes across multiple graph layers. Query traversal starts from a fixed entry point and greedily hops toward the query vector. The graph structure guarantees logarithmic traversal time and high recall.
Then your data changes. Stream tables are refreshed. New embeddings are inserted. Old embeddings are updated or deleted.
HNSW handles inserts well — new nodes are connected into the existing graph. But it handles deletions by marking nodes as tombstones. The edges to and from deleted nodes remain. Graph traversal still visits these dead nodes, backtracks, and tries other paths.
After 50,000 deletions and 50,000 insertions (a net-zero change in table size), your index has 50,000 tombstones — nodes that add latency and reduce recall without contributing results.
IVFFlat has a different problem. Its cluster centroids were computed from the data at build time. As the data distribution shifts — new topics, new product categories, seasonal changes — the centroids become stale. Vectors are assigned to the nearest existing centroid, even if that centroid no longer represents the region well. Recall degrades gradually.
Both degradation modes are silent. Your queries still return results. The results just get slowly worse. Users experience it as "the search feels off" before anyone measures it.
The Old Approach: Scheduled Rebuilds
Most teams handle this with a cron job:
# Every Sunday at 3am
0 3 * * 0 psql -c "REINDEX INDEX CONCURRENTLY idx_docs_embedding;"
This is a blunt instrument. You rebuild every week whether you need to or not. You don't rebuild mid-week even if a massive data import just invalidated half your index. The rebuild is expensive — for a 50-million-row table with 1536-dimensional vectors, REINDEX CONCURRENTLY takes 30–60 minutes and doubles your storage temporarily.
The pg_trickle Approach: Drift-Aware Reindexing
pg_trickle v0.38.0 introduces post_refresh_action — a per-stream-table option that runs after each refresh cycle:
SELECT pgtrickle.alter_stream_table(
'docs_embedded',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.10 -- 10% of rows changed since last reindex
);
After every refresh, the scheduler checks:
$$\text{drift} = \frac{\text{rows_changed_since_last_reindex}}{\text{total_rows}}$$
If drift exceeds the threshold (10% in this example), the scheduler enqueues a REINDEX CONCURRENTLY job in a lower-priority tier. The reindex runs asynchronously — it doesn't block the next refresh cycle or delay search queries.
The key details:
rows_changed_since_last_reindexis tracked in pg_trickle's catalog. Every refresh increments it by the number of rows merged (inserts + updates + deletes). AREINDEXresets it to zero.REINDEX CONCURRENTLYbuilds a new index alongside the old one, then swaps atomically. No downtime. Readers continue using the old index until the new one is ready.- The lower-priority tier means the reindex doesn't compete with your refresh cycles. If your system is under load, the reindex waits. Your data freshness is never sacrificed for index maintenance.
- The threshold is tunable. For a search-critical application, set it to 0.05 (5%). For a background analytics corpus, 0.20 (20%) is fine.
There are also two simpler options:
-- Always ANALYZE after refresh (updates planner statistics)
post_refresh_action => 'analyze'
-- Always REINDEX after refresh (aggressive, for small tables)
post_refresh_action => 'reindex'
For most production systems, reindex_if_drift is the right choice. It rebuilds when necessary and skips when the index is still healthy.
Problem 2: You Can't Fix What You Can't See
A stream table maintaining an embedding corpus has several dimensions you need to monitor:
- Embedding lag. How far behind is the stream table from the source data? If your 10-second schedule is consistently taking 12 seconds, you're falling behind.
- Index age. When was the HNSW index last rebuilt? How much has the data changed since then?
- Drift percentage. What fraction of the current index reflects stale centroid assignments or tombstone accumulation?
- Refresh cost. How long does each differential refresh take? Is it trending up?
Without monitoring, you discover problems when users complain. With monitoring, you discover them in a Grafana dashboard at 9am.
pgtrickle.vector_status()
v0.38.0 adds a dedicated monitoring view for vector stream tables:
SELECT * FROM pgtrickle.vector_status();
stream_table | embedding_col | index_type | total_rows | rows_changed | drift_pct | last_refresh | refresh_lag_ms | last_reindex | index_age_hours | post_action
-----------------+---------------+------------+------------+--------------+-----------+---------------------+----------------+---------------------+-----------------+------------------
docs_embedded | embedding | hnsw | 2,340,891 | 128,445 | 5.49 | 2026-04-27 14:30:02 | 3,241 | 2026-04-25 03:00:00 | 59.50 | reindex_if_drift
user_taste | taste_vec | hnsw | 892,103 | 14,209 | 1.59 | 2026-04-27 14:30:05 | 1,892 | 2026-04-27 08:15:00 | 6.25 | reindex_if_drift
product_search | search_vec | ivfflat | 12,089,442 | 1,450,281 | 12.00 | 2026-04-27 14:29:58 | 8,102 | 2026-04-20 03:00:00 | 179.50 | reindex_if_drift
At a glance: product_search has 12% drift and hasn't been reindexed in a week. If the threshold is 10%, the next refresh will trigger a rebuild.
The view integrates with pg_trickle's self-monitoring system (v0.20+). If you're already scraping pg_trickle metrics with Prometheus, the vector-specific metrics are included automatically.
Prometheus Metrics
The same data is exposed as Prometheus metrics:
# HELP pgtrickle_vector_drift_ratio Fraction of rows changed since last reindex
# TYPE pgtrickle_vector_drift_ratio gauge
pgtrickle_vector_drift_ratio{stream_table="docs_embedded"} 0.0549
pgtrickle_vector_drift_ratio{stream_table="user_taste"} 0.0159
pgtrickle_vector_drift_ratio{stream_table="product_search"} 0.1200
# HELP pgtrickle_vector_index_age_seconds Seconds since last REINDEX
# TYPE pgtrickle_vector_index_age_seconds gauge
pgtrickle_vector_index_age_seconds{stream_table="docs_embedded"} 214200
pgtrickle_vector_index_age_seconds{stream_table="user_taste"} 22500
pgtrickle_vector_index_age_seconds{stream_table="product_search"} 646200
# HELP pgtrickle_vector_refresh_lag_ms Milliseconds since last successful refresh
# TYPE pgtrickle_vector_refresh_lag_ms gauge
pgtrickle_vector_refresh_lag_ms{stream_table="docs_embedded"} 3241
Standard Grafana alerting rules:
- alert: VectorIndexDriftHigh
expr: pgtrickle_vector_drift_ratio > 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "Vector index drift > 15% on {{ $labels.stream_table }}"
description: "Consider lowering reindex_drift_threshold or investigating write volume."
- alert: VectorRefreshLagHigh
expr: pgtrickle_vector_refresh_lag_ms > 30000
for: 5m
labels:
severity: critical
annotations:
summary: "Embedding corpus {{ $labels.stream_table }} is >30s behind source data"
This is the same observability pattern pg_trickle uses for all stream tables, extended with vector-specific dimensions. The on-call engineer doesn't need to know about IVFFlat centroids. They need to know "this metric is above threshold" and have a runbook that says "pg_trickle handles it automatically, but if drift stays high after reindex, investigate write volume."
Problem 3: Multi-Tenant Vector Search
Multi-tenant RAG is where most vector search architectures break down.
The naive approach: one shared table, one shared HNSW index, filter by tenant_id at query time. This works until it doesn't:
- Cross-tenant recall interference. A large tenant's embeddings dominate the HNSW graph. Small tenants' queries traverse through the large tenant's nodes to reach their own data. Recall varies by tenant size.
- Over-fetching. HNSW returns the
kapproximate nearest neighbors globally, then filters to the tenant. If a tenant has 0.1% of the data, you need to retrieve ~1000× candidates to find 10 results. Latency is unpredictable. - Data isolation. A WHERE clause is only as reliable as the developer who writes it. One missing filter and you've leaked data across tenants.
Tiered Tenancy
The right architecture depends on tenant size. pg_trickle supports three patterns:
Large tenants (>1M embeddings): Dedicated stream table per tenant.
-- For tenant "acme" with millions of documents
SELECT pgtrickle.create_stream_table(
name => 'search_corpus_acme',
query => $$
SELECT d.id, d.body, d.embedding, d.metadata
FROM documents d
WHERE d.tenant_id = 'acme'
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
CREATE INDEX ON search_corpus_acme USING hnsw (embedding vector_cosine_ops);
Each large tenant gets a dedicated HNSW index with no cross-tenant interference. The index is smaller, so queries are faster. Drift-aware reindexing operates per-tenant, rebuilding only the indexes that need it.
Medium tenants (10K–1M embeddings): Partitioned stream table with partial indexes.
-- Shared stream table with tenant partitioning
SELECT pgtrickle.create_stream_table(
name => 'search_corpus_medium',
query => $$
SELECT d.id, d.body, d.embedding, d.tenant_id, d.metadata
FROM documents d
WHERE d.tenant_id IN (SELECT tenant_id FROM tenants WHERE tier = 'medium')
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
-- Per-tenant partial indexes
CREATE INDEX ON search_corpus_medium USING hnsw (embedding vector_cosine_ops)
WHERE tenant_id = 'tenant_a';
CREATE INDEX ON search_corpus_medium USING hnsw (embedding vector_cosine_ops)
WHERE tenant_id = 'tenant_b';
-- ... generated dynamically per tenant
Partial HNSW indexes scope the graph to one tenant's data. The planner picks the right partial index when the query includes WHERE tenant_id = ?. No cross-tenant interference, no over-fetching.
Small tenants (<10K embeddings): Shared table with RLS.
-- For hundreds of small tenants sharing one table
SELECT pgtrickle.create_stream_table(
name => 'search_corpus_shared',
query => $$
SELECT d.id, d.body, d.embedding, d.tenant_id
FROM documents d
WHERE d.tenant_id IN (SELECT tenant_id FROM tenants WHERE tier = 'small')
$$,
schedule => '15 seconds',
refresh_mode => 'DIFFERENTIAL'
);
-- RLS enforces isolation at the database level
ALTER TABLE search_corpus_shared ENABLE ROW LEVEL SECURITY;
ALTER TABLE search_corpus_shared FORCE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON search_corpus_shared
AS RESTRICTIVE FOR ALL
USING (tenant_id = current_setting('app.current_tenant_id')::text);
CREATE INDEX ON search_corpus_shared USING hnsw (embedding vector_cosine_ops);
For small tenants, over-fetching is a real trade-off: a tenant with 500 rows in a 50,000-row shared index is 1% of the data, so HNSW may need to visit many non-matching nodes before returning 10 results. In practice this is acceptable because the total shared index is small (small tenants have little data by definition), so even with over-fetching, absolute query latency stays low. RLS guarantees data isolation regardless of application bugs. The shared index means less infrastructure per tenant.
Monitoring Per Tenant
pgtrickle.vector_status() reports per stream table. With the tiered pattern above, large tenants have dedicated entries. Medium tenants share a stream table but have per-partition drift tracking. Small tenants share everything — but their data volume is small enough that drift accumulates slowly.
The Docker Image: One Command to RAG
For teams evaluating pg_trickle + pgvector or building a proof of concept, v0.38.0 ships a Docker image with everything pre-installed:
docker run -d \
-e POSTGRES_PASSWORD=secret \
-p 5432:5432 \
ghcr.io/trickle-labs/pg-trickle-rag:latest
The image includes:
- PostgreSQL 18
- pgvector 0.8+
- pg_trickle (latest release)
- pgai (if available) for in-database embedding generation
-- Everything is ready
CREATE EXTENSION vector;
CREATE EXTENSION pg_trickle;
-- Start building
SELECT pgtrickle.create_stream_table(...);
No Docker Compose files to manage. No extension compilation. No shared_preload_libraries configuration. One container, one port, ready.
The embedding_stream_table() API (v0.40.0)
For v0.37–v0.38, creating a vector stream table requires writing the full denormalization query by hand. This is powerful but verbose. For the common case — "I have a documents table, I want a searchable, indexed, denormalized corpus" — there's a lot of boilerplate.
v0.40.0 introduces a higher-level API:
SELECT pgtrickle.embedding_stream_table(
name => 'docs_search',
source_table => 'documents',
embedding_columns => ARRAY['embedding'],
denormalize_from => ARRAY[
ROW('doc_tags', 'JOIN doc_tags ON doc_tags.doc_id = documents.id', 'tags'),
ROW('doc_metadata', 'LEFT JOIN doc_metadata ON doc_metadata.doc_id = documents.id', 'metadata')
],
schedule => '10 seconds',
index_type => 'hnsw',
post_refresh_action => 'reindex_if_drift'
);
The function:
- Auto-generates the denormalization query from the source table and join specifications.
- Creates the stream table with the generated query.
- Creates the HNSW (or IVFFlat) index on the embedding column(s).
- Configures drift-aware reindexing with sensible defaults.
- Returns a summary of what was created.
For expert users who need to inspect or customize the generated SQL, a dry-run mode returns the query without executing it:
SELECT pgtrickle.embedding_stream_table(
name => 'docs_search',
source_table => 'documents',
embedding_columns => ARRAY['embedding'],
dry_run => true
);
-- Returns: the generated CREATE STREAM TABLE SQL, the index DDL, and the configuration
This is syntactic sugar over existing primitives. Everything it does, you can do with create_stream_table() and CREATE INDEX. The value is removing boilerplate for the 80% use case.
Sparse and Half-Precision Vectors (v0.39.0)
Production RAG systems often use tiered storage for embeddings. Full-precision vector(1536) for the canonical representation. Half-precision halfvec(1536) for indexed search (half the storage, nearly identical recall). Sparse vectors (sparsevec) for SPLADE or learned sparse models used in re-ranking.
v0.39.0 extends the algebraic aggregate support to these types:
-- Half-precision centroid for storage-efficient search
SELECT pgtrickle.create_stream_table(
name => 'category_centroids_half',
query => $$
SELECT category_id,
halfvec_avg(embedding::halfvec(1536)) AS centroid
FROM products
GROUP BY category_id
$$,
refresh_mode => 'DIFFERENTIAL'
);
-- Sparse vector aggregate for SPLADE re-ranking
SELECT pgtrickle.create_stream_table(
name => 'topic_sparse_centroids',
query => $$
SELECT topic_id,
sparsevec_avg(sparse_embedding) AS centroid
FROM documents
GROUP BY topic_id
$$,
refresh_mode => 'DIFFERENTIAL'
);
halfvec_avg maintains running state upcast to full-precision vector(d) internally (to avoid rounding accumulation), then casts back to halfvec(d) on read. sparsevec_avg computes element-wise means over the union of sparse dimensions, treating absent entries as zero.
This matters for storage-tiered architectures. You can maintain a pipeline:
raw documents → vector(1536) embeddings
→ halfvec(1536) search corpus (indexed, 50% storage)
→ sparsevec(1536) re-ranking corpus (indexed, ~10% storage)
Each layer is a stream table. Each is incrementally maintained. Each has its own index and drift monitoring.
Reactive Distance Subscriptions (v0.39.0)
Traditional monitoring is pull-based: you query a dashboard. Reactive subscriptions are push-based: the database tells you when something happens.
v0.39.0 extends reactive subscriptions to vector-distance predicates:
LISTEN fraud_alert;
SELECT pgtrickle.create_reactive_subscription(
'fraud_alert',
$$
SELECT t.id, t.amount, t.merchant
FROM transactions_embedded t
JOIN known_fraud_patterns k
ON t.embedding <=> k.embedding < 0.05
$$
);
This fires a NOTIFY fraud_alert whenever a new transaction's embedding enters within cosine distance 0.05 of a known fraud pattern. The subscription is differential — it only fires for newly matched rows, not on every refresh cycle.
Other applications:
- Content moderation: Alert when a new post is semantically similar to previously flagged content.
- Competitive intelligence: Notify when a new product listing appears close to your product's embedding.
- SLA monitoring: Alert when the average query embedding drifts far from the training distribution (distribution shift detection).
LISTEN content_review;
SELECT pgtrickle.create_reactive_subscription(
'content_review',
$$
SELECT p.id, p.author_id, p.body
FROM posts_embedded p
JOIN flagged_content_centroids fc
ON p.embedding <=> fc.centroid < 0.08
WHERE p.created_at > now() - interval '1 hour'
$$
);
The subscription is maintained by the DVM engine like any stream table, but instead of materializing results into a table, it emits NOTIFY events for new matches. The application listens on the PostgreSQL connection and receives events in real time.
The Operational Playbook
Here's a concrete operational guide for running pgvector + pg_trickle in production.
Initial Setup
-- 1. Install extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trickle;
-- 2. Create your embedding corpus as a stream table
SELECT pgtrickle.create_stream_table(
name => 'search_corpus',
query => $$ ... your denormalization query ... $$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
-- 3. Create the ANN index
CREATE INDEX ON search_corpus USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- 4. Enable drift-aware reindexing
SELECT pgtrickle.alter_stream_table(
'search_corpus',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.10
);
Tuning Guidelines
| Parameter | Conservative | Balanced | Aggressive |
|---|---|---|---|
schedule | 30 seconds | 10 seconds | 3 seconds |
reindex_drift_threshold | 0.05 (5%) | 0.10 (10%) | 0.20 (20%) |
post_refresh_action | reindex_if_drift | reindex_if_drift | analyze |
HNSW m | 32 | 16 | 8 |
HNSW ef_construction | 400 | 200 | 100 |
- Conservative: High recall priority. More frequent reindexing, larger graph degree. For search-critical applications.
- Balanced: Good recall with reasonable maintenance overhead. The default for most systems.
- Aggressive: Maximize throughput, tolerate moderate recall fluctuation. For analytics or non-user-facing search.
Monitoring Checklist
- Embedding lag (
pgtrickle_vector_refresh_lag_ms): Should be below 2× your schedule interval. If consistently above, your refresh is too expensive for the cycle time. - Drift ratio (
pgtrickle_vector_drift_ratio): Steady state should stay below your threshold. If it keeps hitting the threshold, your write rate is high enough to warrant a shorter schedule interval (more frequent refreshes) to reduce the delta size per cycle. - Reindex frequency: Check
last_reindexinvector_status(). If reindexing every day, consider whether the drift threshold is too low or the write volume is genuinely high. - Refresh duration trend: If refresh time is trending up, check whether the delta size is growing (more changes per cycle) or the stream table is getting larger (more rows to merge into).
Failure Modes and Recovery
Refresh consistently late. The differential refresh takes longer than the schedule interval. Solution: increase the schedule interval, reduce query complexity, or check for missing indexes on source tables.
Drift never decreases after reindex. The write rate is so high that by the time the reindex finishes, enough new changes have accumulated to exceed the threshold again. Solution: increase the threshold, or accept that for very high-write-rate tables, the index will always have some drift.
REINDEX CONCURRENTLY fails. PostgreSQL can fail concurrent reindex under certain conditions (e.g., deadlocks with concurrent sessions, or if a conflicting DDL operation runs during the build). pg_trickle retries once; on second failure, it logs a warning and skips until the next threshold crossing. The old index continues to serve queries — the failure is not catastrophic.
What's Built, What's Coming
To be honest about the timeline:
| Feature | Status | Version |
|---|---|---|
| Vector columns in stream tables | Working today | — |
| FULL refresh with pgvector expressions | Working today | — |
| Denormalized corpus pattern (multi-JOIN) | Working today | — |
vector_avg / vector_sum aggregates | Shipping | v0.37.0 |
post_refresh_action (analyze, reindex, reindex_if_drift) | Planned | v0.38.0 |
pgtrickle.vector_status() monitoring view | Planned | v0.38.0 |
| RAG Docker image (pg_trickle + pgvector) | Planned | v0.38.0 |
halfvec_avg / sparsevec_avg aggregates | Planned | v0.39.0 |
| Reactive distance subscriptions | Planned | v0.39.0 |
embedding_stream_table() API | Planned | v0.40.0 |
| Per-tenant ANN patterns (docs + examples) | Planned | v0.40.0 |
If you're running pgvector today and want to start using pg_trickle, the denormalized-corpus and FULL-refresh patterns work right now. vector_avg arrives in the next release. The operational tooling (monitoring, drift-aware reindex) follows immediately after.
The Bigger Picture
The AI infrastructure ecosystem is fragmented. Embeddings are generated by one service, stored in another, indexed by a third, and served by a fourth. Each boundary is a consistency gap. Each service is a failure domain. Each integration is a maintenance burden.
pg_trickle's position is that most of this fragmentation is unnecessary — at least for the PostgreSQL half of the stack. If your transactional data lives in PostgreSQL (and it probably does), there's no fundamental reason your derived embedding data should live somewhere else.
pgvector stores and indexes vectors in PostgreSQL. pg_trickle keeps derived data synchronized with source data in PostgreSQL. Together, they handle the full pipeline from "source data changed" to "search index is current" — inside one database, one process, one ACID transaction boundary.
The remaining external dependency is the embedding model itself. You still need to call an API (or run a local model) to generate embeddings from text. That's a real boundary — neural network inference is expensive and doesn't belong in a database transaction. But everything after the embedding is written — maintaining corpora, computing aggregates, rebuilding indexes, monitoring freshness — that's data infrastructure. And PostgreSQL is very good at data infrastructure.
pg_trickle is an open-source PostgreSQL extension. Source code, documentation, and installation instructions are at github.com/trickle-labs/pg-trickle. The pgvector integration roadmap spans v0.37.0 through v0.40.0.
← Back to Blog Index | Documentation
How pg_trickle Handles Diamond Dependencies
The refresh ordering problem that nobody talks about until it causes double-counting
If you've used pg_trickle for more than a few stream tables, you've probably built a DAG. Stream table C depends on A and B. A and B each depend on the same source table. You refresh, and everything works.
But there's a subtle correctness problem hiding in that topology. It's called a diamond dependency, and if your IVM engine doesn't handle it, your aggregates will be wrong.
The Diamond
Here's the shape:
source_table
/ \
▼ ▼
st_A (agg1) st_B (agg2)
\ /
▼ ▼
st_C (combines A and B)
st_A and st_B are both stream tables that depend on source_table. st_C is a stream table that JOINs or UNIONs the output of st_A and st_B.
The problem: when source_table changes, the delta needs to propagate through both A and B before C can safely refresh. If C refreshes after A but before B, it sees a partially-updated state. Depending on the query, this can cause double-counting, missing rows, or incorrect aggregates.
A Concrete Example
Consider a sales analytics platform:
-- Source table
CREATE TABLE sales (
id bigserial PRIMARY KEY,
region text NOT NULL,
product_id bigint NOT NULL,
amount numeric(12,2) NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
-- Branch A: revenue by region
SELECT pgtrickle.create_stream_table(
'revenue_by_region',
$$SELECT region, SUM(amount) AS revenue, COUNT(*) AS sale_count
FROM sales GROUP BY region$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
-- Branch B: revenue by product
SELECT pgtrickle.create_stream_table(
'revenue_by_product',
$$SELECT product_id, SUM(amount) AS revenue, COUNT(*) AS sale_count
FROM sales GROUP BY product_id$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
-- Diamond tip: executive summary combining both
SELECT pgtrickle.create_stream_table(
'exec_summary',
$$SELECT
'total' AS label,
(SELECT SUM(revenue) FROM revenue_by_region) AS regional_total,
(SELECT SUM(revenue) FROM revenue_by_product) AS product_total
$$,
schedule => '3s', refresh_mode => 'DIFFERENTIAL'
);
In a correct state, regional_total and product_total should always be equal — they're both SUM(amount) over the same source data, just grouped differently.
Without diamond-aware scheduling, here's what can happen:
- A new sale for $1,000 is inserted into
sales. - The scheduler refreshes
revenue_by_region— it picks up the new sale. - The scheduler refreshes
exec_summary— it sees the updatedrevenue_by_regionbut the oldrevenue_by_product. regional_total = $101,000,product_total = $100,000. They disagree.- The scheduler refreshes
revenue_by_product— now it catches up. - Next refresh of
exec_summaryfixes the inconsistency.
For a few seconds, your executive dashboard showed inconsistent numbers. In a financial system, this is a bug. In an audit, this is a finding.
How pg_trickle Solves This
pg_trickle's DAG resolver identifies diamond dependencies at stream table creation time. When you create exec_summary, the engine discovers the diamond:
sales → revenue_by_region → exec_summary
sales → revenue_by_product → exec_summary
Both paths originate from sales and converge at exec_summary. pg_trickle records this as a diamond group.
You can see it:
SELECT * FROM pgtrickle.diamond_groups();
This returns the set of stream tables that share a common ancestor and converge at a common descendant.
The Scheduling Rule
The scheduler enforces a simple invariant: all members of a diamond group must be refreshed to the same frontier before the convergence point is refreshed.
In practice, this means:
saleschanges.- The scheduler refreshes
revenue_by_regionfrom the change buffer. - The scheduler refreshes
revenue_by_productfrom the same change buffer epoch. - Only after both have been refreshed to the same frontier does
exec_summarybecome eligible for refresh.
Steps 2 and 3 can happen in any order, or even in parallel (if you have multiple worker slots). But step 4 is blocked until both are complete.
This is the frontier tracker at work. Each stream table has a frontier — a version vector that tracks which changes it has incorporated. The scheduler won't refresh a downstream table unless all its upstream dependencies have frontiers that are at least as advanced as the change epoch being processed.
What This Costs
Diamond-aware scheduling adds a small amount of latency to the convergence point. Instead of refreshing exec_summary as soon as any upstream changes, it waits for all upstreams in the diamond group to catch up.
In practice, this wait is bounded by the slowest branch of the diamond — typically a few hundred milliseconds. For the correctness guarantee you get in return, this is a good trade.
If your topology doesn't have diamonds, there's zero overhead. The scheduler only applies the diamond constraint when diamond_groups() identifies a non-empty set of convergence points.
Deeper Diamonds
Diamonds can be nested. Consider:
source_1 ──→ st_A ──→ st_C ──→ st_E
source_1 ──→ st_B ──→ st_C
source_2 ──→ st_B ──→ st_D ──→ st_E
source_2 ──→ ──────→ st_D
There are two diamonds here: one at st_C (from source_1 via A and B) and one at st_E (from source_2 via B/C and D). pg_trickle handles arbitrarily nested diamonds — the frontier tracker works at every level of the DAG.
Detecting Diamonds in Existing Deployments
If you're adding a new stream table to an existing DAG and want to check whether it creates a diamond:
-- Before creating the new stream table, check the current DAG
SELECT * FROM pgtrickle.dependency_tree('your_new_stream_table');
-- After creating it, check for diamond groups
SELECT * FROM pgtrickle.diamond_groups();
pg_trickle also logs a notice when a diamond is detected at creation time:
NOTICE: stream table "exec_summary" creates a diamond dependency via
"sales" → ["revenue_by_region", "revenue_by_product"].
Refresh scheduling will ensure frontier consistency.
Why Other Systems Get This Wrong
Most materialized view systems don't handle diamonds at all, because they don't have a concept of incremental refresh ordering. REFRESH MATERIALIZED VIEW is a full recomputation — there's no delta to propagate incorrectly, so there's no diamond problem. (There's also no performance, but that's a different post.)
External IVM systems that do handle incremental deltas often punt on diamonds by documenting it as a known limitation: "avoid creating cyclic or diamond dependency topologies." pg_trickle treats it as a core correctness requirement.
← Back to Blog Index | Documentation
Differential Dataflow for the Rest of Us
The mathematics behind incremental view maintenance, explained without a PhD
pg_trickle maintains query results incrementally. When one row changes, it updates the result without recomputing everything. That sounds simple, but it requires some careful mathematics to get right.
This post explains how that mathematics works — in plain language, without assuming a background in database theory or systems research. The goal is to build enough intuition that you can reason about when incremental maintenance is possible, when it isn't, and why.
If you just want to use pg_trickle, you don't need this. If you want to understand why it works, read on.
The Problem With "Just Recompute"
Start with a concrete example. You're tracking e-commerce orders and maintaining a revenue_by_region table:
-- The query
SELECT region, SUM(total) AS revenue
FROM orders
JOIN customers ON customers.id = orders.customer_id
GROUP BY region;
This query scans every row in orders, joins with customers, and computes sums per group. If orders has 100 million rows, this query reads 100 million rows every time you run it. If you run it every 5 seconds to stay fresh, you're reading 100 million rows every 5 seconds. At any realistic row size, that's gigabytes per second of I/O.
The observation that makes incremental maintenance possible is: most of the time, very little changes.
If 10 new orders come in over 5 seconds, only those 10 orders affect the result. The other 99,999,990 orders haven't changed. Recomputing the full aggregate is wasteful by a factor of 10 million.
Incremental view maintenance answers the question: given that orders changed by some set of rows, what's the corresponding change to revenue_by_region?
Collections and Multisets
Before getting to the math, a quick terminology note.
In differential dataflow, data is represented as multisets — collections where each element has a weight. A weight of +1 means the element is present. A weight of -1 means it was removed. A weight of +2 means it appears twice.
For a SQL table with distinct rows, every present row has weight +1.
When you insert a row, you add it to the multiset with weight +1.
When you delete a row, you add the same row with weight -1. (The net effect: the row is no longer present.)
When you update a row, you add the old value with weight -1 and the new value with weight +1. (The net effect: the old value is removed, the new value is added.)
This weight-based representation turns updates into insertions and deletions. It simplifies the mathematics considerably — you only need rules for how insertions and deletions propagate.
The Δ Notation
In differential dataflow, we use Δ (delta) to mean "the change to."
If T is a table (multiset), then ΔT is the set of rows added and removed from T since some reference point. Each row in ΔT has a weight: +1 for insertions, -1 for deletions.
For a query Q(T) that produces result R, we want to compute ΔR (the change to the result) from ΔT (the change to the input), without touching the unchanged rows in T.
The question is: for each type of query operation, what's the delta rule?
Delta Rules for SQL Operations
Filter (WHERE)
-- Query
SELECT * FROM orders WHERE total > 100;
-- Delta rule
Δresult = { row ∈ ΔT | row.total > 100 }
If a row is inserted and it satisfies the filter, it appears in the result. If it doesn't, it's ignored. This is the simplest rule — delta of a filter is just filtering the delta.
Projection (SELECT columns)
-- Query
SELECT customer_id, total FROM orders;
-- Delta rule
Δresult = { (row.customer_id, row.total) : row ∈ ΔT }
Project each delta row onto the selected columns. Also simple.
Join
Joins are where it gets interesting.
-- Query
SELECT o.*, c.region
FROM orders o
JOIN customers c ON c.id = o.customer_id;
If we change orders (Δorders) or customers (Δcustomers), how does the result change?
The delta rule for a join has two parts:
Part 1 (orders change):
Δresult += Δorders ⋈ customers
For each changed order, join it with the current (unchanged) customer data. The result is the set of new/deleted joined rows contributed by the order changes.
Part 2 (customers change):
Δresult += orders ⋈ Δcustomers
For each changed customer, join it with the current order data. The result is the set of rows that need to be updated because the customer's region changed.
Part 3 (both change simultaneously):
Δresult += Δorders ⋈ Δcustomers
If both change in the same batch, this cross-term captures rows affected by both changes simultaneously. For most workloads, this term is small (it requires a customer and one of their orders to change in the same batch).
In pg_trickle, "current data" in parts 1 and 2 is accessed from the live table, usually via an index lookup on the join key. This is why joins across tables with good indexes are handled efficiently: each delta lookup is a point query, not a scan.
Aggregation (GROUP BY + aggregate functions)
This is the most important delta rule for most analytical use cases.
-- Query
SELECT region, SUM(total) AS revenue, COUNT(*) AS order_count
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY region;
When an order is inserted for a customer in europe:
region = 'europe'delta_revenue = +order.totaldelta_order_count = +1
The delta rule for SUM:
new_sum(g) = old_sum(g) + delta_sum(g)
= old_sum(g) + SUM(weight × value for changed rows in group g)
For COUNT:
new_count(g) = old_count(g) + delta_count(g)
= old_count(g) + SUM(weight for changed rows in group g)
For AVG:
-- AVG is maintained as (running_sum, running_count)
new_avg(g) = new_sum(g) / new_count(g)
The key property: these aggregates are linear. Their delta is a function of the delta inputs, not of the full input. The mathematical name for this is being a monoid homomorphism — the aggregate is a homomorphism from the monoid of multisets to the monoid of aggregate values.
SUM is linear: SUM(A ∪ B) = SUM(A) + SUM(B).
COUNT is linear: COUNT(A ∪ B) = COUNT(A) + COUNT(B).
AVG is not directly linear but can be decomposed into linear components (sum and count).
The Non-Differentiable Cases
Not every aggregate is linear.
MEDIAN (PERCENTILE_CONT(0.5)): You can't compute the median of a union from the medians of the parts. MEDIAN([1,3,5]) = 3. MEDIAN([2,4]) = 3. MEDIAN([1,2,3,4,5]) = 3. But generally, MEDIAN(A ∪ B) ≠ f(MEDIAN(A), MEDIAN(B)). Computing the median incrementally requires knowing the sorted order of all elements, which requires O(n) state.
RANK() OVER (ORDER BY x): Inserting one row can change the rank of every other row. The delta rule is Δrank(row) = COUNT(new rows that rank higher than row). Computing this requires knowing where in the ordering the new row falls, which requires a sorted data structure. The update is O(log n) per changed row, but it's not O(1) per change batch.
DISTINCT counting (COUNT DISTINCT): Removing an element from a DISTINCT count requires knowing whether the element appears elsewhere. This is the "set membership" problem — you need to maintain the full set, not just a count. Approximate cardinality (HyperLogLog) is differentiable; exact DISTINCT counting is not.
MAX and MIN with deletions: Adding a new maximum is easy (new_max = max(old_max, new_value)). Removing the current maximum is hard — you need to find the second largest value. This requires a sorted structure.
pg_trickle is transparent about these limitations. If you create a stream table with a non-differentiable query, it either rejects the DIFFERENTIAL mode or falls back to FULL with a warning. The extension's query analyzer classifies each aggregate and operator before deciding the refresh mode.
The Operator Pipeline
In differential dataflow, a query is represented as a pipeline of operators, each with its own delta rule:
ΔT (raw change from CDC)
│
▼ Filter operator (apply WHERE clauses)
│
▼ Join operator (propagate through JOINs)
│
▼ Project operator (select columns)
│
▼ Aggregate operator (GROUP BY + aggregate functions)
│
▼ ΔR (change to apply to the stream table)
Each operator transforms the incoming delta using its delta rule. The output of each operator is itself a delta — a set of weighted rows to insert or remove.
The final delta ΔR is applied to the stream table using standard SQL INSERT, UPDATE, DELETE — or more efficiently, a single MERGE statement that handles all three cases.
Why This Is Faster Than It Sounds
The pipeline is only computed for the rows in ΔT — the rows that actually changed. Every other row in the source tables is untouched.
For the join operator, "current customer data" is fetched via index lookups on the join key. If 10 orders changed, that's 10 index lookups. Not a table scan.
For the aggregate operator, only the affected groups are recomputed. If 10 orders changed, and they're all from europe, only the europe row in revenue_by_region is updated.
The cost of a refresh cycle scales with:
- The number of rows changed in the source tables during the cycle
- The number of distinct groups affected in each aggregate
- The number of join hops required to propagate the change to the result
For typical OLTP workloads with hundreds of changes per cycle and stable aggregate groups, refresh cycles complete in 10–50ms regardless of total table size.
The Consistency Guarantee
One subtle point: the delta computation must be consistent.
When computing Δorders ⋈ customers (part 1 of the join delta rule), "current customers" should be the customers table after any changes in the current batch have been applied.
pg_trickle handles this by processing deltas in topological order when multiple source tables change simultaneously. If both orders and customers change in the same batch, the join delta is computed with both changes applied, not with a mix of old and new values.
This is the correctness property that makes IVM hard in practice. pg_trickle's engine handles it by processing each refresh cycle as a single transaction that sees a consistent snapshot of all source changes.
The MERGE Application
After computing ΔR, pg_trickle applies it to the stream table using a single MERGE statement:
MERGE INTO revenue_by_region AS target
USING delta_revenue AS source
ON target.region = source.region
WHEN MATCHED AND source.revenue != 0 THEN
UPDATE SET revenue = target.revenue + source.revenue,
order_count = target.order_count + source.order_count
WHEN MATCHED AND source.revenue = 0 AND source.order_count = 0 THEN
DELETE
WHEN NOT MATCHED AND source.revenue != 0 THEN
INSERT (region, revenue, order_count)
VALUES (source.region, source.revenue, source.order_count);
A group with a positive delta gets updated. A group that drops to zero (all orders removed) gets deleted. A new group gets inserted.
The MERGE is atomic — it either fully applies or not at all. This is what makes pg_trickle's consistency guarantee possible: the stream table is always in a state consistent with some version of the source tables.
What This Means for You
The practical implications of the differential dataflow mathematics:
Query design: Stick to filters, joins, projections, and linear aggregates (SUM, COUNT, AVG). These are fully differentiable. Use DIFFERENTIAL mode.
Non-linear aggregates: Use PERCENTILE, RANK, or COUNT DISTINCT? Use FULL mode for those tables. The stream table infrastructure is still useful — scheduling, monitoring, the catalog — but the refresh will scan the full source.
Index design: The join delta rules (Δorders ⋈ customers) require efficient index lookups on join keys. Ensure your source tables have indexes on the columns used in JOIN ON conditions. Missing join indexes make pg_trickle fall back to sequential scans during delta computation.
Change volume: The cost of a refresh cycle scales with the number of changed rows. High-frequency small changes (1–100 rows/cycle) are very cheap. High-frequency bulk operations (10k rows/cycle from batch imports) are more expensive — the delta is large. Use pg_trickle's change_buffer_size GUC to tune how much change is batched before a forced refresh.
The mathematics is not magic — it's a formal system with well-defined properties and limitations. Understanding those properties lets you build stream tables that are fast, correct, and predictable.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
DISTINCT That Doesn't Recount
Reference counting for incremental deduplication
SELECT DISTINCT in a materialized view means a full table scan on every refresh. PostgreSQL has to see all the rows to determine which are unique. There's no shortcut — without knowing the full data set, you can't know if a row is duplicated.
pg_trickle has a shortcut: reference counting. Each distinct value gets a counter tracking how many source rows produce it. Insert a duplicate? Increment the counter. Delete a row? Decrement the counter. When the counter hits zero, the value is removed from the result.
No scan of existing data. O(delta) per refresh.
The Problem
SELECT DISTINCT region, product_category
FROM orders
JOIN customers ON customers.id = orders.customer_id;
This query deduplicates (region, product_category) pairs. With 50 million orders across 200 unique pairs, a full refresh scans 50 million rows to produce 200 rows.
With IVM, 10 new orders come in. Do they create any new (region, product_category) pairs? Or are all 10 in existing pairs? To answer this without scanning, you need to know how many rows currently produce each pair.
The __pgt_dup_count Column
pg_trickle maintains a hidden column __pgt_dup_count on stream tables that use DISTINCT. This column tracks the multiplicity of each row in the result — how many source rows produce it.
-- What the stream table actually stores (internal representation)
SELECT region, product_category, __pgt_dup_count
FROM pgtrickle.distinct_pairs;
region | product_category | __pgt_dup_count
------------+------------------+-----------------
Northeast | Electronics | 12,847
Northeast | Clothing | 8,432
Southeast | Electronics | 15,291
... | ... | ...
The user-visible query (SELECT * FROM distinct_pairs) hides __pgt_dup_count — you just see the distinct values.
Delta Rules
INSERT a row that matches an existing distinct value:
__pgt_dup_count += 1
No new row is added to the result. The value was already there.
INSERT a row with a new distinct value:
INSERT row with __pgt_dup_count = 1
New distinct value appears in the result.
DELETE a row:
__pgt_dup_count -= 1
If __pgt_dup_count = 0: DELETE the row from the result
The distinct value disappears only when all source rows producing it are gone.
UPDATE a row (changes the distinct columns):
Decrement old value's __pgt_dup_count (possibly remove it)
Increment new value's __pgt_dup_count (possibly insert it)
Example
SELECT pgtrickle.create_stream_table(
name => 'active_regions',
query => $$
SELECT DISTINCT c.region
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'active'
$$,
schedule => '5s'
);
Initial state: 5 regions, each with thousands of active orders.
A customer in the "Pacific" region places their first order. Before this, "Pacific" had 0 active orders — it wasn't in the result. Now it has 1.
pg_trickle:
- Sees the INSERT in the change buffer.
- Joins with
customersto get region = "Pacific". - Checks: does "Pacific" exist in the result? No → INSERT with
__pgt_dup_count = 1.
Later, the only Pacific order is cancelled (status changes to 'cancelled', which is filtered out by WHERE status = 'active'). The effective delta is a DELETE of that row.
pg_trickle:
- Decrements Pacific's
__pgt_dup_countfrom 1 to 0. - Removes "Pacific" from the result.
All other regions are untouched. The refresh processes 1 row, not millions.
DISTINCT ON
PostgreSQL's DISTINCT ON is a different feature from DISTINCT. It returns one row per group, ordered by a specified column:
SELECT DISTINCT ON (customer_id)
customer_id, order_id, total, created_at
FROM orders
ORDER BY customer_id, created_at DESC;
This returns the most recent order per customer. It's a common pattern for "latest row per group."
pg_trickle handles DISTINCT ON with the same reference-counting approach, but the tie-breaking logic is more complex. The stream table maintains the winning row (based on the ORDER BY) and updates it when:
- A new row with a higher sort value is inserted (it becomes the new winner).
- The current winner is deleted (the next-best row becomes the winner).
This requires knowing what the "next-best" row is, which in turn requires a lookup against the source data. The cost is O(changed groups) — for each group that was affected by the delta, one query to find the new winner.
DISTINCT with Expressions
DISTINCT can appear with computed columns:
SELECT DISTINCT date_trunc('month', created_at) AS month
FROM events;
The reference counting applies to the computed value, not the raw column. Two events on different days in the same month produce the same distinct value and share a __pgt_dup_count.
Performance
The reference-count approach makes DISTINCT maintenance O(|ΔT|) — proportional to the number of changed rows, not the table size.
| Scenario | FULL refresh | DIFFERENTIAL refresh |
|---|---|---|
| 10 inserts, all in existing groups | ~500ms (full scan) | <1ms (10 counter increments) |
| 10 inserts, 2 new groups | ~500ms | <1ms (8 increments + 2 inserts) |
| 10 deletes, none empties a group | ~500ms | <1ms (10 counter decrements) |
| 10 deletes, 1 group drops to zero | ~500ms | <1ms (9 decrements + 1 delete) |
The only scenario where DIFFERENTIAL doesn't help is when the delta touches every group — but that's rare for DISTINCT queries, which by definition have fewer groups than rows.
When Not to Use DISTINCT in Stream Tables
DISTINCT adds the __pgt_dup_count overhead to every row. If your query naturally produces unique rows (e.g., GROUP BY with a key that guarantees uniqueness), adding DISTINCT is redundant and wasteful.
Check with EXPLAIN:
EXPLAIN SELECT DISTINCT region, SUM(total)
FROM orders GROUP BY region;
If the GROUP BY already produces unique (region) rows, the DISTINCT is a no-op. Remove it — pg_trickle still maintains the stream table correctly, without the reference-counting overhead.
Summary
DISTINCT in stream tables uses reference counting (__pgt_dup_count) to avoid full-scan deduplication. Insert increments, delete decrements, and rows are removed only when the count reaches zero.
The cost is O(delta), not O(table). For the common case — many source rows, few distinct values, small changes per cycle — this is orders of magnitude faster than recomputing.
DISTINCT ON works similarly but with tie-breaking logic. Remove DISTINCT if GROUP BY already ensures uniqueness. And don't worry about the hidden counter column — it's invisible to your queries.
← Back to Blog Index | Documentation
Distributed IVM with Citus
Incremental view maintenance across sharded PostgreSQL
Citus distributes PostgreSQL tables across multiple worker nodes. You get horizontal write scaling, parallel query execution, and the ability to store more data than fits on a single machine.
What you lose is the ability to maintain derived data easily. CREATE MATERIALIZED VIEW doesn't work across distributed tables in any incremental way. REFRESH MATERIALIZED VIEW on a Citus coordinator scans every shard, pulls the data to the coordinator, computes the result, and writes it back. At scale, this is slow and resource-intensive.
pg_trickle's Citus integration (v0.32–v0.34) solves this. It maintains stream tables across a Citus cluster with shard-aware CDC, distributed delta routing, and automatic recovery after shard rebalances.
How It Works
CDC on Workers
In a standard (non-Citus) setup, pg_trickle attaches triggers to source tables on a single PostgreSQL instance. In a Citus setup, the source tables are distributed — each worker node holds a subset of the shards.
pg_trickle installs CDC triggers on every worker that hosts a shard of the source table. Each worker captures changes into a local change buffer. The coordinator's scheduler polls these change buffers via dblink connections to each worker.
Worker 1 (shards 1-4): Worker 2 (shards 5-8):
orders_1 → changes_1 orders_5 → changes_5
orders_2 → changes_2 orders_6 → changes_6
orders_3 → changes_3 orders_7 → changes_7
orders_4 → changes_4 orders_8 → changes_8
↓ ↓
└──────── coordinator ───────┘
↓
delta computation
↓
stream table (on coordinator or distributed)
Delta Computation
The coordinator merges change buffers from all workers, computes the delta using the standard DVM engine, and applies it to the stream table. The stream table itself can be:
- Reference table: Replicated to all nodes. Good for small lookup tables and aggregates.
- Distributed table: Sharded across workers. Good for large stream tables that need to be co-located with queries.
-- Create a distributed source table
SELECT create_distributed_table('orders', 'customer_id');
SELECT create_distributed_table('customers', 'id');
-- Create a stream table over distributed sources
SELECT pgtrickle.create_stream_table(
'revenue_by_region',
$$SELECT c.region, SUM(o.amount) AS revenue, COUNT(*) AS order_count
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY c.region$$,
schedule => '3s',
refresh_mode => 'DIFFERENTIAL'
);
pg_trickle detects that the source tables are Citus-distributed and automatically sets up the per-worker CDC infrastructure.
Shard-Aware Delta Routing
The key optimization: not all shards contribute to every delta.
If an order is inserted on Worker 1, only the change buffers on Worker 1 contain new data. pg_trickle's scheduler knows this — it checks each worker's change buffer depth before polling. Workers with no changes are skipped entirely.
For a 16-worker cluster where only 3 workers have new data since the last refresh, the coordinator only polls 3 workers instead of 16. This reduces network overhead and coordinator CPU time proportionally.
Co-Located Joins
Citus is fastest when JOINs are co-located — when the joined tables are distributed on the same column, so the JOIN can execute locally on each worker without shuffling data.
pg_trickle respects this. If orders and customers are both distributed on customer_id (or id for customers, co-located via a distribution column that matches), the delta computation pushes down to the workers:
Worker 1:
local_orders_change + local_customers → local_delta
Worker 2:
local_orders_change + local_customers → local_delta
...
Coordinator:
merge(local_delta_1, local_delta_2, ...) → stream table MERGE
The expensive part — the JOIN — happens on the workers in parallel. The coordinator only needs to merge the per-worker deltas, which is typically a small aggregation.
Handling Shard Rebalances
When you add a node to a Citus cluster or rebalance shards, data moves between workers. This invalidates the change buffers on the old and new worker for the affected shards.
pg_trickle detects shard rebalances by monitoring pg_dist_shard_placement. When a shard moves:
- The scheduler pauses refresh for affected stream tables.
- CDC triggers are installed on the new shard placement.
- Change buffers for the moved shard are reset.
- A targeted full refresh is run for the affected groups.
- Normal differential refresh resumes.
This happens automatically. From the application's perspective, the stream table might be slightly stale during the rebalance (the refresh is paused), but it never returns incorrect data.
Multi-Tenant Analytics
Citus's most common pattern is multi-tenant: each tenant's data is on the same shard, co-located by tenant_id. Stream tables work naturally with this pattern:
-- Distributed by tenant_id
SELECT create_distributed_table('events', 'tenant_id');
-- Per-tenant aggregation
SELECT pgtrickle.create_stream_table(
'tenant_event_summary',
$$SELECT
tenant_id,
event_type,
COUNT(*) AS event_count,
MAX(created_at) AS last_event
FROM events
WHERE created_at >= now() - interval '7 days'
GROUP BY tenant_id, event_type$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
-- Distribute the stream table on the same column
SELECT create_distributed_table('tenant_event_summary', 'tenant_id');
Each tenant's events are on a single worker. The CDC and delta computation for that tenant happen entirely on that worker. The coordinator only handles the final MERGE. This scales linearly with the number of workers.
Cross-Shard Aggregation
Not all queries are shard-local. Global aggregates (total revenue across all tenants, system-wide event counts) require data from every worker:
-- Global aggregate across all tenants
SELECT pgtrickle.create_stream_table(
'global_event_counts',
$$SELECT event_type, COUNT(*) AS total_events
FROM events
GROUP BY event_type$$,
schedule => '5s', refresh_mode => 'DIFFERENTIAL'
);
For cross-shard aggregates, each worker computes a partial delta locally, and the coordinator combines them. This is more expensive than shard-local queries (it requires data transfer from every worker), but the differential approach means only the changed groups are transferred — not the entire dataset.
Limitations
-
Maximum workers: pg_trickle has been tested with up to 32 Citus workers. Beyond that, the coordinator's
dblinkpolling becomes the bottleneck. For larger clusters, consider increasing the polling interval. -
Non-co-located JOINs: If the JOIN requires shuffling data between workers (non-co-located distribution columns), the delta computation falls back to the coordinator. This is slower but still correct.
-
IMMEDIATE mode: Not supported on Citus-distributed stream tables. The trigger would need to perform cross-shard operations within the source transaction, which Citus doesn't support transactionally. Use DIFFERENTIAL mode.
-
Schema changes on distributed tables:
ALTER STREAM TABLE ... QUERYworks, but the full refresh during schema migration reads from all workers sequentially. Plan for a longer migration window on large clusters.
When to Use Citus vs. Single-Node
If your data fits on a single PostgreSQL instance (up to a few TB with good hardware), you don't need Citus. pg_trickle on a single node is simpler, faster, and has no cross-node coordination overhead.
Use Citus + pg_trickle when:
- Your source data exceeds single-node capacity
- You need horizontal write scaling (more workers = more write throughput)
- Your tenants have strict data isolation requirements (each tenant on a dedicated shard)
- Your analytics queries benefit from parallel execution across workers
← Back to Blog Index | Documentation
Drain Mode: Zero-Downtime Upgrades for Stream Tables
Graceful quiesce before maintenance, rolling restarts, and extension upgrades
You need to upgrade pg_trickle. Or restart PostgreSQL for a configuration change. Or run a maintenance operation that requires no active refreshes.
If you just restart PostgreSQL while a refresh is in progress, the refresh is interrupted. The stream table is left in a partially-updated state. pg_trickle recovers on the next startup — it detects the interrupted refresh and either retries or marks the table for repair — but it's not clean.
Drain mode provides a clean shutdown path. pgtrickle.drain() tells the scheduler to stop dispatching new refreshes and wait for in-flight refreshes to complete. When all refreshes are done, pgtrickle.is_drained() returns true, and you can safely restart.
The API
-- Signal drain
SELECT pgtrickle.drain();
-- Check status
SELECT pgtrickle.is_drained();
-- false (still waiting for in-flight refreshes)
-- Wait a moment...
SELECT pgtrickle.is_drained();
-- true (all refreshes complete, scheduler idle)
After drain completes:
- The scheduler is running but not dispatching new work.
- All in-flight refreshes have completed.
- Change buffers continue accumulating (CDC triggers still fire).
- Stream tables are still queryable.
The Upgrade Workflow
# Step 1: Drain
psql -c "SELECT pgtrickle.drain();"
# Step 2: Wait for drain
while ! psql -qtAc "SELECT pgtrickle.is_drained();" | grep -q 't'; do
sleep 2
done
# Step 3: Upgrade
psql -c "ALTER EXTENSION pg_trickle UPDATE;"
# Step 4: Resume
psql -c "SET pg_trickle.enabled = on;"
Between steps 2 and 4, no refreshes are running. The ALTER EXTENSION UPDATE can safely migrate schema, catalog tables, and internal state without racing against active refresh operations.
What Happens During Drain
When drain() is called:
- No new refreshes are dispatched. The scheduler's dispatch loop skips all tables.
- In-flight refreshes continue. Any refresh that's already executing (running a delta query, applying a MERGE) completes normally.
- IMMEDIATE mode refreshes still fire. Since they're synchronous within user transactions, they can't be deferred. Drain only affects background-scheduled refreshes.
- CDC continues. Triggers keep writing to change buffers. WAL decoder keeps running. Changes accumulate.
The drain is "soft" — it doesn't kill any processes or abort any transactions. It just stops starting new work.
Drain Duration
How long does drain take? It depends on the longest currently-running refresh.
In practice:
- Most refreshes complete in under 1 second.
- A complex FULL refresh on a large table might take 10–30 seconds.
- If a refresh is stuck (waiting on a lock, for example), drain waits indefinitely.
You can set a timeout:
-- Drain with 60-second timeout
SELECT pgtrickle.drain(timeout_seconds => 60);
If in-flight refreshes don't complete within 60 seconds, drain() returns false and the in-flight refreshes continue. You can then decide: wait longer, or proceed with the restart (accepting the interrupted-refresh recovery cost).
CloudNativePG and Rolling Restarts
In Kubernetes deployments with CloudNativePG, rolling restarts are the norm. The operator restarts pods one at a time, waiting for each to be ready before restarting the next.
Drain mode integrates with this:
- A
preStophook callspgtrickle.drain(). - The readiness probe checks
pgtrickle.is_drained(). - Once drained, the pod is marked unready, and the operator proceeds with the restart.
# CloudNativePG Cluster manifest (excerpt)
spec:
postgresql:
preStop:
exec:
command:
- psql
- -c
- "SELECT pgtrickle.drain();"
This ensures zero interrupted refreshes during rolling restarts. Change buffers accumulate during the restart window and are processed on the next scheduler cycle after the pod comes back.
Drain and HA Failover
During a PostgreSQL failover (primary → standby promotion), drain mode isn't typically used — failovers are unplanned. But for planned failovers (maintenance, OS patching):
- Drain the current primary:
SELECT pgtrickle.drain(); - Wait for drain:
SELECT pgtrickle.is_drained(); - Promote the standby.
- pg_trickle's launcher on the new primary detects promotion via
pg_is_in_recovery()and starts the scheduler.
The change buffers on the old primary are replicated to the standby via WAL. No changes are lost.
Monitoring Drain State
SELECT * FROM pgtrickle.health_summary();
During drain, health_summary() includes:
scheduler_state | drain_requested | inflight_refreshes | drain_elapsed_seconds
-----------------+-----------------+--------------------+------------------------
draining | t | 2 | 4.7
When inflight_refreshes reaches 0 and scheduler_state changes to drained, it's safe to proceed.
Resuming After Drain
Drain is not permanent. To resume normal operation without restarting:
-- Cancel the drain
SET pg_trickle.enabled = on;
The scheduler resumes dispatching. The accumulated change buffers are processed in the next cycle. Depending on how long the drain lasted, the first post-drain refresh may be larger than usual (more accumulated changes).
Summary
Drain mode is the safe shutdown path for pg_trickle. drain() stops new refreshes. In-flight refreshes complete. is_drained() confirms it's safe to proceed.
Use it before:
ALTER EXTENSION pg_trickle UPDATE- PostgreSQL restart
- Planned failover
- CloudNativePG rolling restart
The alternative — restarting mid-refresh — works (pg_trickle recovers), but it's not clean. Drain mode is one function call for a clean cutover.
← Back to Blog Index | Documentation
Error Budgets for Stream Tables
SRE-style freshness monitoring with p50/p99 latency, staleness tracking, and budget consumption
Your team has an SLA: "the dashboard must reflect data no older than 30 seconds." You've set the stream table schedule to 5 seconds. That should leave plenty of headroom. But is it?
What if the stream table occasionally takes 20 seconds to refresh because of a complex join? What if it fails 3 times in a row and the scheduler suspends it? What if the change buffer grows large enough to cause a FULL fallback?
You won't know unless you measure. pg_trickle's sla_summary() function provides SRE-style metrics: percentile latencies, freshness lag, error counts, and an error budget that tells you how much headroom you have before your SLA is violated.
The SLA Summary API
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics');
stream_table | p50_refresh_ms | p99_refresh_ms | freshness_lag_s | error_count | error_budget_pct | window
-------------------+----------------+----------------+-----------------+-------------+------------------+---------
dashboard_metrics | 4.8 | 23.1 | 2.3 | 1 | 94.2 | 1h
What Each Metric Means
p50_refresh_ms: The median refresh duration over the measurement window. If this is close to your schedule interval, you're refreshing slower than you're scheduling.
p99_refresh_ms: The 99th percentile refresh duration. This is your worst-case performance (excluding true outliers). If your schedule is 5 seconds and p99 is 23 seconds, 1% of your refreshes are taking nearly 5× longer than expected.
freshness_lag_s: The current staleness — time since the last successful refresh completed. If this exceeds your SLA, the data is stale right now.
error_count: Number of failed refreshes in the measurement window. Each failure means one missed refresh cycle.
error_budget_pct: The remaining error budget as a percentage. This is the key metric:
$$\text{error_budget_pct} = 100 \times \left(1 - \frac{\text{actual_staleness_violations}}{\text{allowed_staleness_violations}}\right)$$
At 100%, you've had zero violations. At 0%, you've exhausted your budget. Below 0%, you're in breach.
Defining SLAs
pg_trickle doesn't enforce SLAs — it measures compliance. The SLA is defined by your team's freshness requirement.
The sla_summary() function accepts an optional window parameter:
-- Last hour (default)
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics');
-- Last 24 hours
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics', window => '24h');
-- Last 7 days
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics', window => '7d');
For a 30-second freshness SLA measured over 1 hour:
- There are 120 expected refresh opportunities (at 30-second intervals).
- If 6 of those opportunities were missed (refresh took too long or failed), that's a 5% violation rate → 95% error budget remaining.
Alerting on Error Budget
Combine sla_summary() with reactive subscriptions or pg_cron to alert when the budget is low:
-- pg_cron: check error budget every 5 minutes
SELECT cron.schedule('sla-check', '*/5 * * * *', $$
DO $$
DECLARE
budget NUMERIC;
BEGIN
SELECT error_budget_pct INTO budget
FROM pgtrickle.sla_summary('dashboard_metrics');
IF budget < 50 THEN
PERFORM pg_notify('sla_warning',
format('dashboard_metrics error budget at %s%%', budget));
END IF;
IF budget < 10 THEN
PERFORM pg_notify('sla_critical',
format('dashboard_metrics error budget at %s%%', budget));
END IF;
END $$;
$$);
A listener picks up the NOTIFY and routes it to PagerDuty, Slack, or your alerting system.
Setting Meaningful SLAs
Different stream tables serve different purposes. A real-time fraud detection table and a weekly summary report shouldn't have the same SLA.
Recommended tiers:
| Tier | Freshness Target | Example |
|---|---|---|
| Critical | < 5 seconds | Fraud detection, inventory levels, live pricing |
| Standard | < 30 seconds | Dashboards, operational metrics, search indexes |
| Background | < 5 minutes | Daily reports, batch analytics, archival |
For each tier, set the stream table schedule to ~⅓ of the freshness target. This gives 3 refresh attempts before the SLA is breached:
| Tier | Freshness Target | Schedule |
|---|---|---|
| Critical | 5s | 1–2s |
| Standard | 30s | 10s |
| Background | 5m | 1–2m |
Diagnosing Budget Consumption
When the error budget is dropping, diagnose the cause:
1. Check refresh history
SELECT
refresh_id,
refresh_mode,
duration_ms,
rows_changed,
status,
error_message
FROM pgtrickle.get_refresh_history('dashboard_metrics')
WHERE status != 'success' OR duration_ms > 10000
ORDER BY refresh_id DESC
LIMIT 20;
Look for:
status = 'error'— failed refreshes. Checkerror_message.duration_msspikes — slow refreshes that may exceed the SLA window.refresh_mode = 'FULL'— fallbacks from DIFFERENTIAL to FULL, which are usually slower.
2. Check for mode fallbacks
SELECT
COUNT(*) FILTER (WHERE refresh_mode = 'DIFFERENTIAL') AS differential_count,
COUNT(*) FILTER (WHERE refresh_mode = 'FULL') AS full_count,
AVG(duration_ms) FILTER (WHERE refresh_mode = 'DIFFERENTIAL') AS avg_diff_ms,
AVG(duration_ms) FILTER (WHERE refresh_mode = 'FULL') AS avg_full_ms
FROM pgtrickle.get_refresh_history('dashboard_metrics')
WHERE created_at > NOW() - INTERVAL '1 hour';
If FULL refreshes are frequent and slow, investigate why DIFFERENTIAL is falling back (change ratio too high, spilling, etc.).
3. Check change buffer sizes
SELECT * FROM pgtrickle.pgt_cdc_status
WHERE pgt_name = 'dashboard_metrics';
Large change buffers → larger deltas → slower DIFFERENTIAL refreshes → potential SLA violations.
Error Budget Across All Stream Tables
For a system-wide view:
SELECT
s.name,
s.schedule,
sla.p50_refresh_ms,
sla.p99_refresh_ms,
sla.freshness_lag_s,
sla.error_budget_pct,
CASE
WHEN sla.error_budget_pct >= 95 THEN 'healthy'
WHEN sla.error_budget_pct >= 50 THEN 'warning'
ELSE 'critical'
END AS status
FROM pgtrickle.pgt_stream_tables s
CROSS JOIN LATERAL pgtrickle.sla_summary(s.name) sla
ORDER BY sla.error_budget_pct ASC;
This ranks all stream tables by error budget health. Tables at the top of the list need attention.
Prometheus Integration
pg_trickle exports SLA metrics as Prometheus gauges:
pg_trickle_refresh_p50_ms{stream_table="dashboard_metrics"} 4.8
pg_trickle_refresh_p99_ms{stream_table="dashboard_metrics"} 23.1
pg_trickle_freshness_lag_seconds{stream_table="dashboard_metrics"} 2.3
pg_trickle_error_budget_pct{stream_table="dashboard_metrics"} 94.2
Use these in Grafana dashboards and alerting rules:
# Prometheus alert rule
- alert: StreamTableSLABreach
expr: pg_trickle_error_budget_pct < 20
for: 5m
labels:
severity: critical
annotations:
summary: "Stream table {{ $labels.stream_table }} error budget below 20%"
Summary
sla_summary() gives you SRE-style visibility into stream table health: percentile latencies, freshness lag, error counts, and error budget.
Set SLAs per tier (critical/standard/background). Monitor the error budget. Alert when it drops below a threshold. Diagnose with refresh history, mode fallback analysis, and change buffer checks.
The goal: know you're meeting your freshness SLA before a customer asks why the dashboard is stale. The error budget tells you exactly how much headroom you have left.
← Back to Blog Index | Documentation
Event Sourcing Read Models Without Replay
Project live read-optimized views from an append-only event store — no replay required
Event sourcing is architecturally elegant and operationally painful. The elegance is in the append-only event log: every state change is recorded as an immutable fact, the full history is preserved, and you can derive any view of the data by replaying events from the beginning. The pain is in that last part — "replaying events from the beginning."
When your event store has 500 million events and you need to add a new read model (say, "revenue by product category for the last 90 days"), you face a choice. You can replay 500 million events through your new projection, which takes hours and requires careful orchestration. Or you can implement the projection going forward and accept that it only has data from its creation date onward. Neither option is great.
pg_trickle eliminates this trade-off. Define your read model as a SQL query over the events table, register it as a stream table, and the initial materialization happens once (a full query against the events table). After that, every new event is processed incrementally — the read model stays current within milliseconds of the event being committed, without replaying anything.
The Event Store Pattern
A typical PostgreSQL event store looks like this:
CREATE TABLE events (
event_id bigserial PRIMARY KEY,
stream_id uuid NOT NULL, -- aggregate ID
event_type text NOT NULL,
payload jsonb NOT NULL,
metadata jsonb DEFAULT '{}',
created_at timestamptz DEFAULT now(),
version integer NOT NULL -- per-stream sequence number
);
CREATE INDEX ON events (stream_id, version);
CREATE INDEX ON events (event_type, created_at);
Events are immutable. You never update or delete them. New state is expressed by appending new events. An order goes through OrderPlaced → OrderConfirmed → OrderShipped → OrderDelivered, each as a separate row in the events table.
Read models (projections) materialize the current state by folding over these events. The "current order status" projection looks at the latest event per stream. The "revenue by region" projection aggregates OrderPlaced events. The "inventory levels" projection sums ItemAdded and ItemRemoved events per product.
Traditional Projection Approaches
In-memory projectors: A service subscribes to the event stream, maintains state in memory, and projects events as they arrive. Fast, but state is lost on restart (requires full replay), and you need one service per projection.
Catch-up subscription: A service reads events from a position marker, processes them, and advances the marker. Handles restarts but is sequential — adding a new projection means replaying from the beginning.
Periodic snapshot + replay: Take a snapshot of the projection state periodically, replay from the snapshot position on restart. Reduces replay cost but adds complexity around snapshot management.
All of these approaches share a problem: the projection logic lives in application code, separate from the database. It must handle ordering, idempotency, and failure recovery. It's another service to deploy, monitor, and scale.
Read Models as Stream Tables
With pg_trickle, a read model is just a SQL query materialized as a stream table:
-- Current order status (latest event per order)
SELECT pgtrickle.create_stream_table(
'order_status',
$$
SELECT DISTINCT ON (stream_id)
stream_id AS order_id,
payload->>'status' AS status,
payload->>'customer_id' AS customer_id,
payload->>'total' AS total,
created_at AS last_updated
FROM events
WHERE event_type IN ('OrderPlaced', 'OrderConfirmed', 'OrderShipped', 'OrderDelivered', 'OrderCancelled')
ORDER BY stream_id, version DESC
$$
);
When a new OrderShipped event is appended for order abc-123, the incremental refresh:
- Detects the new event row
- Evaluates the
DISTINCT ONlogic for streamabc-123 - Updates the single row in
order_statusfor that order
It does not re-read the other 499,999,999 events. It does not replay the history of order abc-123. It processes one new event and produces one row update. The cost is constant regardless of how many historical events exist.
Revenue Analytics Without ETL
A common read model for e-commerce is revenue aggregation:
SELECT pgtrickle.create_stream_table(
'revenue_by_category',
$$
SELECT
payload->>'category' AS category,
date_trunc('day', created_at) AS day,
SUM((payload->>'amount')::numeric) AS revenue,
COUNT(*) AS order_count
FROM events
WHERE event_type = 'OrderPlaced'
GROUP BY payload->>'category', date_trunc('day', created_at)
$$
);
New OrderPlaced events are processed incrementally. The SUM and COUNT for the affected category and day are adjusted. No ETL pipeline, no separate analytics database, no nightly batch job. The read model is always within seconds of the event store.
For more sophisticated analytics — running totals, moving averages, cohort breakdowns — you can cascade stream tables:
-- Daily revenue per category (from above)
-- → Monthly summary with growth rate
SELECT pgtrickle.create_stream_table(
'monthly_category_performance',
$$
SELECT
category,
date_trunc('month', day) AS month,
SUM(revenue) AS monthly_revenue,
SUM(order_count) AS monthly_orders,
SUM(revenue) / NULLIF(SUM(order_count), 0) AS avg_order_value
FROM revenue_by_category
GROUP BY category, date_trunc('month', day)
$$
);
The cascade handles the dependency ordering automatically. New events flow through: event → daily aggregate → monthly aggregate. Each step is incremental.
Inventory Projections
Inventory is the canonical event sourcing example: items are added and removed, and the current level is the sum of all additions minus all removals.
SELECT pgtrickle.create_stream_table(
'inventory_levels',
$$
SELECT
payload->>'product_id' AS product_id,
payload->>'warehouse_id' AS warehouse_id,
SUM(
CASE event_type
WHEN 'ItemReceived' THEN (payload->>'quantity')::integer
WHEN 'ItemShipped' THEN -(payload->>'quantity')::integer
WHEN 'ItemAdjusted' THEN (payload->>'adjustment')::integer
ELSE 0
END
) AS current_stock
FROM events
WHERE event_type IN ('ItemReceived', 'ItemShipped', 'ItemAdjusted')
GROUP BY payload->>'product_id', payload->>'warehouse_id'
$$
);
Every ItemReceived event increments the stock for that product-warehouse pair. Every ItemShipped event decrements it. The stream table maintains the running total without ever replaying the full history. If your warehouse processes 10,000 shipments per hour, the inventory levels update 10,000 times per hour — each update touching exactly one row in the projection.
Adding New Projections Without Replay
The traditional pain point of event sourcing is adding a new projection to an existing system. With a catch-up subscription, you need to replay from event zero. With pg_trickle, you define a new stream table and run the initial materialization:
-- New requirement: track customer lifetime value
SELECT pgtrickle.create_stream_table(
'customer_ltv',
$$
SELECT
payload->>'customer_id' AS customer_id,
COUNT(*) AS total_orders,
SUM((payload->>'amount')::numeric) AS lifetime_value,
MIN(created_at) AS first_order,
MAX(created_at) AS latest_order
FROM events
WHERE event_type = 'OrderPlaced'
GROUP BY payload->>'customer_id'
$$
);
The first refresh_stream_table call performs a full materialization — it reads all OrderPlaced events and computes the aggregates. This is a one-time cost, equivalent to the initial replay. After that, every new OrderPlaced event is processed incrementally. The read model is live from that moment forward.
The critical difference from a catch-up subscription: the "replay" is just a SQL query that PostgreSQL optimizes and executes in parallel. No single-threaded event processor. No position markers. No at-least-once vs. exactly-once semantics to worry about. The database handles it.
Consistency Guarantees
One of the subtle advantages of running projections inside the database is transactional consistency. When a new event is committed and the stream table is refreshed, the read model update happens in the same transactional context (or at a known, bounded lag if using background refresh).
This eliminates the classic event sourcing consistency problem: a user places an order, immediately queries their order list, and doesn't see it because the projection service hasn't processed it yet. With pg_trickle's IMMEDIATE refresh mode, the read model is updated before the transaction commits. The user always sees their own writes.
-- IMMEDIATE mode: projection updates in the same transaction as the event
SELECT pgtrickle.alter_stream_table('order_status', refresh_mode := 'IMMEDIATE');
For high-throughput projections where same-transaction consistency isn't needed, use DEFERRED mode and the background scheduler. The projection will be at most a few seconds behind — still far better than the minutes-to-hours lag typical of catch-up subscription systems.
When This Replaces Your Projection Service
You don't need a separate projection service when:
- Your read models can be expressed as SQL queries (aggregations, joins, filters, windowing)
- You want transactional consistency between writes and reads
- You want new projections to be deployable without replaying history
- You want projections to handle late events and corrections automatically
- You don't want to operate Kafka, RabbitMQ, or an event bus for internal projections
You still need a separate service when:
- Your projection logic involves external API calls or side effects
- You need to send emails or notifications as part of projection processing
- Your projection requires custom business logic that can't be expressed in SQL
- You're projecting across multiple databases or services
For most CRUD-heavy applications with analytics requirements, the SQL-based approach covers 80–90% of projection needs. The remaining projections (those with side effects) still benefit from the event store — they just read from it using a traditional subscriber.
Event sourcing gives you the history. pg_trickle gives you the read models — live, consistent, and without the replay tax.
← Back to Blog Index | Documentation
EXISTS and NOT EXISTS: The Delta Rules Nobody Talks About
Semi-joins and anti-joins, maintained incrementally
EXISTS and NOT EXISTS subqueries appear in almost every non-trivial SQL codebase. "Show me orders that have at least one item over $100." "Show me customers who haven't placed an order this month." They're the SQL equivalent of set membership and set complement.
Making them incremental is trickier than it looks. A semi-join (EXISTS) has binary output per row — the row either qualifies or it doesn't. A single insert into the subquery table can flip that binary for multiple outer rows. An anti-join (NOT EXISTS) is even worse: adding a matching row removes outer rows from the result.
pg_trickle handles both, using delta-key pre-filtering and inverted weight semantics. This post explains the rules.
EXISTS as a Semi-Join
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.total > 1000
);
This returns customers who have at least one high-value order. From a set perspective, it's the semi-join: customers ⋉ (orders WHERE total > 1000).
The key property: duplicates in the subquery don't matter. Whether a customer has 1 or 100 high-value orders, they appear in the result exactly once.
Delta Rule for EXISTS
When orders change (insert/delete), the stream table delta is:
Insert into orders (new high-value order):
- Check: does the customer already have a qualifying order? (Was the EXISTS already true?)
- If no → the customer is newly qualifying. Add them to the result (+1 weight).
- If yes → no change. The EXISTS was already satisfied.
Delete from orders (removed high-value order):
- Check: does the customer still have any qualifying orders?
- If no → the customer no longer qualifies. Remove them from the result (-1 weight).
- If yes → no change. Other orders still satisfy the EXISTS.
This "before/after" check is the core of the semi-join delta. pg_trickle implements it using a reference count on the subquery match:
refcount(customer_id) = number of matching orders for that customer
- INSERT: increment refcount. If it went from 0 → 1, emit +1 to result.
- DELETE: decrement refcount. If it went from 1 → 0, emit -1 to result.
- If refcount was >1 before and is still >1 after, no result change.
NOT EXISTS as an Anti-Join
SELECT c.customer_id, c.name
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.created_at > NOW() - INTERVAL '30 days'
);
Customers who haven't ordered in the last 30 days. This is the anti-join: customers ▷ (recent orders).
Delta Rule for NOT EXISTS
The delta is the inverse of EXISTS:
Insert into orders (new recent order):
- Check: did the customer previously have zero matching orders? (Was NOT EXISTS true?)
- If yes → the customer now has a matching order. Remove them from the result (-1 weight).
- If no → no change. The NOT EXISTS was already false.
Delete from orders (removed recent order):
- Check: does the customer now have zero matching orders?
- If yes → the customer re-qualifies. Add them to the result (+1 weight).
- If no → no change. Other recent orders remain.
Same reference counting, inverted logic. The cost is identical.
Delta-Key Pre-Filtering
The critical optimization is pre-filtering by the join key. When orders change, pg_trickle doesn't scan all customers to check the EXISTS condition. It filters by the join key from the change buffer.
If 5 new orders come in for customer IDs {12, 34, 56}, pg_trickle:
- Extracts the distinct
customer_idvalues from the change buffer: {12, 34, 56}. - Checks the reference count only for those 3 customers.
- Updates the result only if the reference count crossed the 0/1 boundary.
The cost is proportional to the number of distinct join keys in the change buffer, not the size of the customer table.
-- Internal logic (simplified)
WITH changed_keys AS (
SELECT DISTINCT customer_id
FROM pgtrickle_changes.changes_orders
WHERE __pgt_op IN ('I', 'D')
),
new_counts AS (
SELECT c.customer_id, COUNT(o.id) AS cnt
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
AND o.total > 1000
WHERE c.customer_id IN (SELECT customer_id FROM changed_keys)
GROUP BY c.customer_id
)
-- Emit delta based on cnt crossing 0 boundary
...
Correlated Subqueries
EXISTS subqueries are often correlated — they reference columns from the outer query:
SELECT p.product_id, p.name
FROM products p
WHERE EXISTS (
SELECT 1 FROM inventory i
WHERE i.product_id = p.product_id
AND i.warehouse_id = p.default_warehouse_id
AND i.quantity > 0
);
The correlation here is on two columns: product_id and default_warehouse_id. pg_trickle extracts both as the join key for pre-filtering.
When inventory changes for a specific (product_id, warehouse_id) pair, only that pair is checked against the reference count. Products in other warehouses are untouched.
SubLink Extraction
PostgreSQL's internal representation of EXISTS subqueries is called a "SubLink." pg_trickle's parser extracts SubLinks from the WHERE clause and converts them to semi-join or anti-join operators in the OpTree.
The extraction handles:
- Simple
WHERE EXISTS (...)— direct semi-join. WHERE NOT EXISTS (...)— direct anti-join.WHERE EXISTS (...) AND other_condition— semi-join followed by filter.WHERE EXISTS (...) OR other_condition— requires special handling (covered below).
The OR Case
WHERE EXISTS (SELECT 1 FROM orders WHERE ...) OR status = 'VIP'
An OR with an EXISTS is harder because you can't simply convert to a semi-join — the row might qualify from the non-EXISTS branch. pg_trickle rewrites this as a UNION ALL:
-- Branch 1: rows qualifying via EXISTS
SELECT ... WHERE EXISTS (...)
UNION ALL
-- Branch 2: rows qualifying via the other condition (minus those already in Branch 1)
SELECT ... WHERE status = 'VIP' AND NOT EXISTS (...)
Each branch is then maintained independently with the rules described above.
IN and NOT IN
IN with a subquery is semantically equivalent to EXISTS:
-- These are equivalent
WHERE customer_id IN (SELECT customer_id FROM vip_list)
WHERE EXISTS (SELECT 1 FROM vip_list WHERE vip_list.customer_id = customers.customer_id)
pg_trickle normalizes IN (subquery) to EXISTS during parsing. The delta rules are the same.
NOT IN is equivalent to NOT EXISTS, with one important difference: NOT IN has tricky NULL semantics. If the subquery returns any NULL, NOT IN returns FALSE for all outer rows. pg_trickle handles this correctly by tracking whether the subquery contains NULLs and short-circuiting the anti-join logic when it does.
Nested EXISTS
EXISTS subqueries can be nested:
SELECT d.department_id
FROM departments d
WHERE EXISTS (
SELECT 1 FROM employees e
WHERE e.department_id = d.department_id
AND EXISTS (
SELECT 1 FROM certifications c
WHERE c.employee_id = e.employee_id
AND c.type = 'security_clearance'
)
);
"Departments that have at least one employee with a security clearance."
pg_trickle processes nested EXISTS by flattening the SubLinks bottom-up:
- Inner EXISTS: certifications → employees (semi-join on employee_id)
- Outer EXISTS: filtered employees → departments (semi-join on department_id)
Each level uses its own reference count and delta-key pre-filtering. A new certification insert triggers a check on the employee, which may trigger a check on the department.
Performance
The cost of maintaining EXISTS/NOT EXISTS is dominated by the reference-count lookup. For each changed key, pg_trickle needs to know the current count of matching rows.
This requires either:
- A maintained count (stored alongside the stream table data), or
- A query against the current source data for the affected keys.
pg_trickle uses the latter — it queries the source tables filtered to the changed keys. This is efficient when the number of changed keys is small (typical case). It can be expensive when a bulk operation affects many keys.
| Scenario | Changed keys | Cost |
|---|---|---|
| 1 new order | 1 customer lookup | <1ms |
| 100 new orders, 30 distinct customers | 30 customer lookups | ~5ms |
| Bulk import: 100K orders, 10K customers | 10K customer lookups | ~200ms |
| FULL refresh (fallback) | All rows | Same as non-IVM |
For the bulk import case, pg_trickle's AUTO mode may decide that FULL refresh is faster than 10,000 individual lookups. The cost model accounts for the number of distinct join keys, not just the number of changed rows.
Summary
EXISTS and NOT EXISTS are maintained incrementally via reference counting on the join key. A match count tracks how many subquery rows qualify for each outer row. When the count crosses the 0/1 boundary, the outer row is added to or removed from the result.
Delta-key pre-filtering ensures the cost is proportional to the number of changed join keys, not the table size. Correlated subqueries, nested EXISTS, and OR conditions are all handled through SubLink extraction and rewrite.
The result: subqueries that would force a full scan in a materialized view are maintained incrementally in a stream table. Write the EXISTS you need. pg_trickle figures out the delta.
← Back to Blog Index | Documentation
Foreign Tables as Stream Table Sources
IVM over data that lives in another database — or in S3
Your data isn't all in one PostgreSQL database. Some of it is in another PostgreSQL instance across the network (postgres_fdw). Some of it is in Parquet files on S3 (parquet_fdw). Some of it comes from a CSV feed (file_fdw).
Can you create a stream table that aggregates across these sources? Yes — with caveats.
pg_trickle supports foreign tables as stream table sources. The CDC mechanism is different (no triggers on foreign tables, so polling-based detection is used), and the performance characteristics are different (every change detection requires a full scan of the foreign table). But it works, and for small-to-medium foreign tables, it works well.
How It Works
Regular source tables use trigger-based CDC: a row-level trigger fires on every DML and writes the change to a buffer table. Foreign tables can't have triggers (in most FDW implementations), so pg_trickle uses a different approach:
Polling-based change detection:
- On each refresh cycle, pg_trickle reads the current contents of the foreign table.
- It compares the current contents with the last known state (using content hashing).
- Rows that are new, changed, or deleted are identified and written to the change buffer.
- The delta query proceeds as normal.
Step 2 is the expensive part. It requires a full scan of the foreign table. For a 1-million-row foreign table, this means a full network round-trip and comparison on every refresh cycle.
Setting Up
-- Foreign server to another PostgreSQL instance
CREATE SERVER remote_analytics FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'analytics-db', dbname 'analytics', port '5432');
CREATE USER MAPPING FOR CURRENT_USER SERVER remote_analytics
OPTIONS (user 'reader', password 'secret');
-- Foreign table
CREATE FOREIGN TABLE remote_products (
id INT,
name TEXT,
category TEXT,
price NUMERIC
) SERVER remote_analytics OPTIONS (table_name 'products');
-- Stream table using the foreign table
SELECT pgtrickle.create_stream_table(
name => 'product_summary',
query => $$
SELECT
rp.category,
COUNT(*) AS product_count,
AVG(rp.price) AS avg_price,
MIN(rp.price) AS min_price
FROM remote_products rp
GROUP BY rp.category
$$,
schedule => '30s',
cdc_mode => 'trigger' -- only trigger mode works with FDW
);
Note: You must use cdc_mode => 'trigger' (or 'auto' which starts with triggers). WAL-based CDC doesn't work with foreign tables because foreign table DML doesn't produce local WAL records.
pg_trickle detects that remote_products is a foreign table and automatically switches to polling-based change detection for that source.
Mixed Sources: Foreign + Local
Stream tables can reference both foreign and local tables:
SELECT pgtrickle.create_stream_table(
name => 'order_product_summary',
query => $$
SELECT
o.customer_id,
rp.category,
SUM(o.total) AS total_spent,
COUNT(*) AS order_count
FROM orders o
JOIN remote_products rp ON rp.id = o.product_id
GROUP BY o.customer_id, rp.category
$$,
schedule => '10s'
);
Here:
ordersis a local table → trigger-based CDC (fast, per-row).remote_productsis foreign → polling-based detection (full scan).
When orders changes, the delta is computed using only the changed orders (trigger CDC). When remote_products changes, pg_trickle detects the change via polling and recomputes the affected groups.
The practical effect: changes to local tables are reflected in 10 seconds (the schedule). Changes to foreign tables are also reflected in 10 seconds, but each check requires a full scan of the foreign table.
File-Based FDWs
file_fdw reads from CSV/TSV files on the server's filesystem:
CREATE FOREIGN TABLE exchange_rates (
currency TEXT,
rate NUMERIC,
effective_date DATE
) SERVER file_server OPTIONS (filename '/data/exchange_rates.csv', format 'csv');
SELECT pgtrickle.create_stream_table(
name => 'revenue_in_usd',
query => $$
SELECT
o.customer_id,
SUM(o.total * er.rate) AS revenue_usd
FROM orders o
JOIN exchange_rates er ON er.currency = o.currency
AND er.effective_date = CURRENT_DATE
GROUP BY o.customer_id
$$,
schedule => '1m'
);
When the CSV file is updated (new exchange rates), pg_trickle detects the change on the next polling cycle and recomputes the affected aggregates.
Caveat: file_fdw doesn't support transactions. If the file is updated while pg_trickle is reading it, you might get inconsistent results. Use atomic file replacement (write to a temp file, then mv) to avoid this.
Parquet and S3
With parquet_fdw or parquet_s3_fdw:
CREATE FOREIGN TABLE s3_events (
event_id BIGINT,
event_type TEXT,
payload JSONB,
created_at TIMESTAMP
) SERVER parquet_s3 OPTIONS (
filename 's3://my-bucket/events/*.parquet'
);
SELECT pgtrickle.create_stream_table(
name => 'event_type_counts',
query => $$
SELECT event_type, COUNT(*) AS cnt
FROM s3_events
GROUP BY event_type
$$,
schedule => '5m'
);
This brings S3 data into pg_trickle's IVM pipeline. The full scan of S3 data happens every 5 minutes (the schedule), and only changes are propagated to the stream table.
Performance note: S3 reads are slow compared to local tables. Set a longer schedule (minutes, not seconds) to amortize the scan cost.
Performance Characteristics
| Source Type | CDC Method | Per-Cycle Cost | Best Schedule |
|---|---|---|---|
| Local table (with PK) | Trigger | O(delta) | 1s–10s |
| Local table (no PK) | Trigger + content hash | O(delta) | 1s–10s |
| Foreign table (postgres_fdw) | Polling | O(foreign table size) | 10s–5m |
| Foreign table (file_fdw) | Polling | O(file size) | 1m–1h |
| Foreign table (parquet_fdw/S3) | Polling | O(S3 read latency + data size) | 5m–1h |
The fundamental limitation: foreign table change detection requires a full scan because there's no trigger or WAL mechanism to capture individual changes. The schedule should be set long enough that the scan cost is acceptable.
Optimization: Materialize First
For large foreign tables or frequent refreshes, consider materializing the foreign data into a local table first:
-- Local copy of foreign data
CREATE TABLE local_products AS SELECT * FROM remote_products;
-- Periodic sync (pg_cron)
SELECT cron.schedule('sync-products', '*/5 * * * *', $$
TRUNCATE local_products;
INSERT INTO local_products SELECT * FROM remote_products;
$$);
-- Stream table uses local copy (trigger CDC, fast)
SELECT pgtrickle.create_stream_table(
name => 'product_summary',
query => $$ SELECT ... FROM local_products ... $$,
schedule => '5s'
);
This separates the foreign-table sync (every 5 minutes, full scan) from the stream table refresh (every 5 seconds, trigger-based delta). The stream table gets fast incremental maintenance; the foreign data sync happens on a longer cadence.
Summary
Foreign tables work as stream table sources with polling-based change detection. Every refresh cycle requires a full scan of the foreign table to detect changes. This is inherently slower than trigger-based CDC but enables IVM over data that lives outside PostgreSQL.
Use foreign tables directly when:
- The foreign table is small (<100K rows)
- The refresh schedule is long enough to amortize the scan cost
- Simplicity matters more than optimal performance
Materialize into a local table first when:
- The foreign table is large
- You need sub-second refresh latency
- The foreign table changes infrequently relative to local tables
Either way, pg_trickle handles the rest. Foreign or local, the delta rules are the same.
← Back to Blog Index | Documentation
Funnel Analysis and Cohort Retention at Scale
Computing conversion funnels, retention matrices, and session aggregates incrementally — keeping product analytics live
Product analytics is dominated by two questions: "where do users drop off?" (funnels) and "do users come back?" (retention). These questions drive product decisions worth millions of dollars. They're also among the most expensive queries to compute at scale.
A conversion funnel scans every user's event history to determine how far they progressed through a sequence of steps. A retention matrix examines every user's activity across multiple time periods. For a product with 10 million monthly active users generating 500 million events per month, these queries read hundreds of millions of rows and take minutes to complete.
And yet the answers change slowly. Of those 500 million events, only the ones generated in the last few minutes are new. The funnel for users who signed up last week hasn't changed — those users' journeys are already complete. The retention for January's cohort was fixed by the end of February. Only the current period's numbers are actively evolving.
pg_trickle exploits this property. By maintaining funnel and retention analytics as stream tables, only new events trigger updates. Historical cohorts are untouched. The live analytics dashboard stays current within seconds, even as the total event volume reaches billions.
The Classic Funnel Query
A typical e-commerce funnel tracks users through: Visit → Product View → Add to Cart → Checkout → Purchase. The SQL looks like:
SELECT
date_trunc('week', first_visit) AS cohort_week,
COUNT(DISTINCT user_id) AS visitors,
COUNT(DISTINCT user_id) FILTER (WHERE viewed_product) AS product_viewers,
COUNT(DISTINCT user_id) FILTER (WHERE added_to_cart) AS cart_adders,
COUNT(DISTINCT user_id) FILTER (WHERE started_checkout) AS checkout_starters,
COUNT(DISTINCT user_id) FILTER (WHERE completed_purchase) AS purchasers
FROM (
SELECT
user_id,
MIN(created_at) FILTER (WHERE event_type = 'page_visit') AS first_visit,
bool_or(event_type = 'product_view') AS viewed_product,
bool_or(event_type = 'add_to_cart') AS added_to_cart,
bool_or(event_type = 'checkout_start') AS started_checkout,
bool_or(event_type = 'purchase') AS completed_purchase
FROM events
GROUP BY user_id
) user_journeys
GROUP BY date_trunc('week', first_visit);
This query reads the entire events table, computes per-user journey completions, and aggregates by cohort. For 500 million events, it's a multi-minute query. Run it every time a product manager opens the dashboard and you're spending significant database resources on repetitive computation.
Funnel as a Stream Table
Break it into two layers — user journey state and cohort aggregation:
-- Layer 1: Per-user funnel progression
SELECT pgtrickle.create_stream_table(
'user_funnel_state',
$$
SELECT
user_id,
MIN(created_at) FILTER (WHERE event_type = 'page_visit') AS first_visit,
bool_or(event_type = 'page_visit') AS visited,
bool_or(event_type = 'product_view') AS viewed_product,
bool_or(event_type = 'add_to_cart') AS added_to_cart,
bool_or(event_type = 'checkout_start') AS started_checkout,
bool_or(event_type = 'purchase') AS completed_purchase
FROM events
GROUP BY user_id
$$
);
When a user triggers an add_to_cart event, only that user's row in user_funnel_state is updated. The added_to_cart flag flips from false to true. The other 10 million users' states are untouched.
-- Layer 2: Cohort aggregation
SELECT pgtrickle.create_stream_table(
'weekly_funnel',
$$
SELECT
date_trunc('week', first_visit) AS cohort_week,
COUNT(*) AS total_users,
COUNT(*) FILTER (WHERE viewed_product) AS product_viewers,
COUNT(*) FILTER (WHERE added_to_cart) AS cart_adders,
COUNT(*) FILTER (WHERE started_checkout) AS checkout_starters,
COUNT(*) FILTER (WHERE completed_purchase) AS purchasers
FROM user_funnel_state
WHERE first_visit IS NOT NULL
GROUP BY date_trunc('week', first_visit)
$$
);
The cohort aggregation reads from the user funnel state (not from raw events). When one user's state changes, only their cohort's counts are adjusted. If a user from the March 15 cohort completes a purchase, the purchasers count for that week increments by 1. All other weeks are untouched.
The cascade processes: one new event → one user state update → one cohort count adjustment. Total rows processed: 3, regardless of whether you have 1 million or 1 billion historical events.
Retention Matrices
Retention analysis asks: of the users who signed up in week W, what fraction were active in week W+1, W+2, W+3, etc.?
-- User-week activity matrix
SELECT pgtrickle.create_stream_table(
'user_weekly_activity',
$$
SELECT
user_id,
date_trunc('week', MIN(created_at)) AS signup_week,
date_trunc('week', created_at) AS active_week,
COUNT(*) AS event_count
FROM events
GROUP BY user_id, date_trunc('week', created_at)
$$
);
This stream table maintains one row per user per active week. When a user generates events in a new week, a new row appears. When they generate more events in the same week, the count increments.
The retention matrix is then:
SELECT pgtrickle.create_stream_table(
'retention_matrix',
$$
SELECT
signup_week,
(EXTRACT(EPOCH FROM active_week - signup_week) / 604800)::integer AS weeks_since_signup,
COUNT(DISTINCT user_id) AS active_users
FROM user_weekly_activity
GROUP BY signup_week, (EXTRACT(EPOCH FROM active_week - signup_week) / 604800)::integer
$$
);
Each cell in the retention matrix — "signup week X, active in week X+N" — is a distinct count maintained incrementally. When a user from the January cohort is active in their 8th week, only the (January, +8) cell increments. The other hundreds of cells in the matrix are untouched.
This is dramatic efficiency for mature products. A product with 2 years of weekly cohorts has 104 × 104 = 10,816 cells in the full retention matrix. But in any given week, only the "current week" column changes (existing users becoming active this week). That's at most 104 cell updates. The other 10,712 cells represent historical retention that will never change again.
Session-Based Funnels
Some funnels are per-session rather than per-user lifetime. "In a single session, how many users go from landing page to signup?"
-- Session boundaries (30-minute gap = new session)
SELECT pgtrickle.create_stream_table(
'session_funnels',
$$
WITH sessions AS (
SELECT
user_id,
created_at,
event_type,
SUM(CASE WHEN created_at - lag_ts > interval '30 minutes' THEN 1 ELSE 0 END)
OVER (PARTITION BY user_id ORDER BY created_at) AS session_id
FROM (
SELECT *,
LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS lag_ts
FROM events
) e
)
SELECT
date_trunc('day', MIN(created_at)) AS day,
user_id,
session_id,
bool_or(event_type = 'landing_page') AS saw_landing,
bool_or(event_type = 'signup_form') AS saw_signup_form,
bool_or(event_type = 'signup_complete') AS completed_signup
FROM sessions
GROUP BY user_id, session_id
$$
);
Each new event is assigned to a session (based on the 30-minute gap heuristic) and the session's funnel state is updated. Sessions that ended hours ago are never re-examined.
Conversion Rate Over Time
The most actionable metric is often the conversion rate trend — is it improving or declining?
SELECT pgtrickle.create_stream_table(
'daily_conversion_rates',
$$
SELECT
date_trunc('day', first_visit) AS day,
COUNT(*) AS total_visitors,
COUNT(*) FILTER (WHERE completed_purchase) AS purchasers,
COUNT(*) FILTER (WHERE completed_purchase)::float / NULLIF(COUNT(*), 0) AS conversion_rate
FROM user_funnel_state
WHERE first_visit IS NOT NULL
GROUP BY date_trunc('day', first_visit)
$$
);
This gives you a live conversion rate per day, updated incrementally. When a user who visited today completes a purchase (possibly hours later), today's conversion rate ticks up. Product managers can watch the conversion rate move in real time during an A/B test launch, without waiting for a nightly analytics rebuild.
Segmented Funnels
Real product analytics always slices by dimensions — device type, acquisition channel, geographic region, user plan:
SELECT pgtrickle.create_stream_table(
'funnel_by_channel',
$$
SELECT
u.acquisition_channel,
date_trunc('week', ufs.first_visit) AS cohort_week,
COUNT(*) AS visitors,
COUNT(*) FILTER (WHERE ufs.viewed_product) AS viewers,
COUNT(*) FILTER (WHERE ufs.completed_purchase) AS purchasers
FROM user_funnel_state ufs
JOIN users u ON u.id = ufs.user_id
GROUP BY u.acquisition_channel, date_trunc('week', ufs.first_visit)
$$
);
The join with the users table brings in segmentation dimensions. When a user's funnel state changes, their segment's counts are adjusted. When a user's segment changes (e.g., they upgrade from "free" to "paid"), both the old and new segment's counts are adjusted.
Replacing Amplitude / Mixpanel / PostHog Analytics
Product analytics SaaS tools charge per event volume. At scale (hundreds of millions of events per month), this becomes expensive. More importantly, your data lives in a third-party system — you can't join it with operational data, run arbitrary queries, or maintain custom metrics.
With pg_trickle, product analytics is just SQL:
| Feature | SaaS analytics | pg_trickle stream tables |
|---|---|---|
| Funnel computation | Proprietary query engine | Standard SQL (GROUP BY, FILTER) |
| Retention matrices | Pre-built visualization | SQL query → dashboard tool |
| Real-time updates | Minutes of delay | Seconds (refresh interval) |
| Custom metrics | Limited to tool's model | Arbitrary SQL |
| Data residency | Vendor's cloud | Your PostgreSQL instance |
| Cost model | Per event ($$$) | Per compute (database cost) |
| Joins with business data | Export/import | Direct JOIN in same database |
The trade-off is visualization. SaaS tools provide beautiful funnel charts and retention heatmaps out of the box. With pg_trickle, you connect a visualization tool (Grafana, Metabase, Superset) to the stream tables and build your own dashboards. The data is always fresh — the visualization layer just reads from pre-computed tables.
Performance at Scale
For a product with 10M MAU and 500M events/month:
| Query | Full computation | Incremental update |
|---|---|---|
| Weekly funnel (all users) | 45s | 3ms (per event batch) |
| Retention matrix (104 weeks) | 2.5min | 1ms (per active user) |
| Daily conversion rate | 12s | <1ms (per user state change) |
The incremental cost is per-event or per-user-state-change — constant regardless of historical data volume. Your product analytics stay responsive as your user base grows from 100K to 10M, without upgrading your analytics infrastructure.
Stop computing funnels from scratch. Let the differential engine maintain your product analytics incrementally — fresh conversion rates, live retention matrices, and instant cohort insights without billion-row scans.
← Back to Blog Index | Documentation
GROUPING SETS, ROLLUP, and CUBE — Incrementally
Multi-dimensional aggregation maintained by delta, not by full scan
GROUPING SETS, ROLLUP, and CUBE are PostgreSQL's multi-dimensional aggregation features. They let you compute subtotals, grand totals, and cross-tabulations in a single query. They're also the features most likely to make your DBA wince when you put them in a materialized view, because they multiply the work of an already-expensive GROUP BY.
pg_trickle maintains them incrementally. The trick is automatic decomposition: a single CUBE query is rewritten into multiple UNION ALL branches, one per grouping level. Each branch is maintained as an independent delta.
Quick Refresher
If you're familiar with grouping sets, skip to the next section. If not, here's the 30-second version:
-- ROLLUP: subtotals for each prefix of the grouping columns
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY ROLLUP(region, product);
This produces:
- Per-region, per-product totals
- Per-region subtotals (product = NULL)
- Grand total (region = NULL, product = NULL)
CUBE is the power set — every combination of grouping columns:
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY CUBE(region, product);
This adds:
- Per-product subtotals (region = NULL)
GROUPING SETS lets you pick exactly which combinations:
SELECT region, product, channel, SUM(revenue)
FROM sales
GROUP BY GROUPING SETS (
(region, product),
(region, channel),
(region),
()
);
The Problem With Full Refresh
A GROUP BY CUBE(a, b, c) over three columns produces $2^3 = 8$ grouping levels. Over four columns: 16. Over five: 32. Each level is a separate aggregation pass.
For a table with 10 million rows, CUBE(a, b, c) effectively runs 8 separate GROUP BY queries, each scanning 10 million rows. If you're refreshing this as a materialized view every 5 seconds, you're scanning 80 million rows every 5 seconds.
With IVM, 10 new rows inserted means updating at most 8 groups per grouping level — roughly 64 group updates instead of 80 million row scans.
The Rewrite
pg_trickle's query parser detects ROLLUP, CUBE, and GROUPING SETS and rewrites them into a UNION ALL of standard GROUP BY queries. This happens at stream table creation time — the defining query is normalized before the differential engine sees it.
For example:
-- What you write
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY ROLLUP(region, product);
-- What pg_trickle sees internally
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY region, product
UNION ALL
SELECT region, NULL::text AS product, SUM(revenue)
FROM sales
GROUP BY region
UNION ALL
SELECT NULL::text AS region, NULL::text AS product, SUM(revenue)
FROM sales;
Each branch of the UNION ALL is a standard GROUP BY that pg_trickle knows how to maintain incrementally. The delta rules for SUM, COUNT, and other algebraic aggregates apply directly.
Creating a Stream Table
SELECT pgtrickle.create_stream_table(
name => 'sales_cube',
query => $$
SELECT
region,
product_category,
date_trunc('month', sale_date) AS month,
SUM(revenue) AS total_revenue,
COUNT(*) AS num_sales,
AVG(revenue) AS avg_revenue
FROM sales
GROUP BY CUBE(region, product_category, date_trunc('month', sale_date))
$$,
schedule => '10s'
);
Internally, this is decomposed into 8 UNION ALL branches (one for each subset of {region, product_category, month}). Each branch is maintained independently.
When a new sale is recorded, pg_trickle:
- Identifies the affected groups in each branch (e.g., region="Northeast", product="Electronics", month="2026-04").
- Applies the algebraic delta:
new_sum = old_sum + revenue,new_count = old_count + 1. - Updates only those groups, in all 8 branches.
The GROUPING() Function
PostgreSQL's GROUPING() function distinguishes actual NULL values from NULL used as a "grand total" marker:
SELECT
region,
product,
GROUPING(region) AS is_region_total,
GROUPING(product) AS is_product_total,
SUM(revenue)
FROM sales
GROUP BY CUBE(region, product);
pg_trickle preserves this in the rewrite. Each UNION ALL branch sets the appropriate GROUPING() bits as constants:
-- Branch for per-region subtotals
SELECT region, NULL::text AS product,
0 AS is_region_total, -- region is a real group
1 AS is_product_total, -- product is rolled up
SUM(revenue)
FROM sales
GROUP BY region;
The GROUPING() values are deterministic per branch, so they don't need delta computation — they're constant columns.
Drill-Down Dashboards
The classic use case for CUBE/ROLLUP is drill-down analytics. A dashboard shows:
- Grand total (all regions, all products, all months)
- User clicks a region → subtotals by product and month for that region
- User clicks a product → per-month detail for that region+product
With a traditional materialized view, each drill level requires a query against the base table or a separate materialized view per level.
With a CUBE stream table, all levels are precomputed and maintained incrementally in a single table:
-- Grand total
SELECT total_revenue FROM sales_cube
WHERE region IS NULL AND product_category IS NULL AND month IS NULL;
-- Region subtotals
SELECT product_category, month, total_revenue FROM sales_cube
WHERE region = 'Northeast'
AND product_category IS NOT NULL
AND month IS NOT NULL;
The query hits a small, precomputed table instead of scanning millions of rows. And it's always fresh — within schedule seconds of reality.
ROLLUP for Hierarchical Totals
ROLLUP is specifically designed for hierarchical aggregation. If your grouping columns have a natural hierarchy (year → quarter → month → day), ROLLUP produces exactly the subtotals you need:
SELECT pgtrickle.create_stream_table(
name => 'revenue_hierarchy',
query => $$
SELECT
date_trunc('year', sale_date) AS year,
date_trunc('quarter', sale_date) AS quarter,
date_trunc('month', sale_date) AS month,
SUM(revenue) AS total,
COUNT(*) AS num_sales
FROM sales
GROUP BY ROLLUP(
date_trunc('year', sale_date),
date_trunc('quarter', sale_date),
date_trunc('month', sale_date)
)
$$,
schedule => '5s'
);
This produces 4 levels:
- Per year + quarter + month
- Per year + quarter
- Per year
- Grand total
For a new sale in April 2026, pg_trickle updates 4 groups: (2026, Q2, April), (2026, Q2), (2026), and the grand total. Four group updates regardless of table size.
Custom GROUPING SETS
When CUBE or ROLLUP generates too many combinations, use explicit GROUPING SETS:
SELECT pgtrickle.create_stream_table(
name => 'targeted_summary',
query => $$
SELECT
region,
channel,
product_category,
SUM(revenue) AS total,
COUNT(*) AS cnt
FROM sales
GROUP BY GROUPING SETS (
(region, product_category),
(channel, product_category),
(region),
()
)
$$,
schedule => '10s'
);
This produces exactly 4 grouping levels — not the 8 that CUBE would produce. Each is maintained as a separate UNION ALL branch with independent delta rules.
Performance Characteristics
The decomposition into UNION ALL branches means the number of "virtual queries" grows with the number of grouping sets. For CUBE over N columns, that's $2^N$ branches.
| Columns | CUBE branches | ROLLUP branches |
|---|---|---|
| 2 | 4 | 3 |
| 3 | 8 | 4 |
| 4 | 16 | 5 |
| 5 | 32 | 6 |
Each branch has its own delta computation. The per-branch cost is small (proportional to the number of changed groups), but the constant factor matters when you have 32 branches.
Practical limit: CUBE over 5+ columns works but produces a lot of output rows. If you don't need all $2^N$ combinations, use explicit GROUPING SETS to include only the levels you actually query.
Summary
GROUPING SETS, ROLLUP, and CUBE are automatically decomposed into UNION ALL branches, each maintained incrementally using standard algebraic delta rules.
The result: drill-down dashboards, hierarchical totals, and cross-tabulations that update in milliseconds instead of seconds. The precomputed table contains every grouping level you need, always fresh, always queryable.
Use ROLLUP for hierarchical data. Use CUBE when you need every combination. Use GROUPING SETS when you need exactly the levels you query. And let pg_trickle handle the math.
← Back to Blog Index | Documentation
High Availability Failover with pg_trickle and Patroni
How stream table state and change buffers survive a primary switchover, and what it takes to achieve zero data loss
Running pg_trickle in production means running it in a high-availability cluster. Nobody deploys a single PostgreSQL instance for critical workloads anymore — you have a primary, one or more standbys, and an HA controller (Patroni, Stolon, pg_auto_failover, or CloudNativePG) that handles automatic failover. The question is: what happens to your stream tables when the primary fails and a standby is promoted?
The short answer is: everything works. pg_trickle's state lives entirely in regular PostgreSQL tables (the catalog, change buffers, and materialized stream table data). All of this is replicated to standbys via standard WAL streaming. When a standby is promoted, it has a complete, consistent copy of all stream table state as of the last replicated WAL position.
The longer answer involves understanding the failure modes, the recovery semantics, and the configuration choices that determine whether you get zero data loss or merely very low data loss.
What State Does pg_trickle Maintain?
pg_trickle stores all its state in PostgreSQL tables within the pgtrickle and pgtrickle_changes schemas:
- Catalog tables (
pgtrickle.pgt_stream_tables, etc.) — metadata about stream table definitions, refresh modes, DAG relationships - Change buffer tables (
pgtrickle_changes.changes_<oid>) — pending row changes captured by CDC triggers since the last refresh - Materialized data — the stream table's result set, stored as a regular heap table
- Shared memory state — scheduler bookkeeping, refresh counters, lock states
Items 1–3 are durable, WAL-logged, and replicated. Item 4 is in shared memory and is reconstructed on startup from the durable state.
Synchronous vs. Asynchronous Replication
The data loss characteristics during failover depend on your replication mode:
Synchronous replication (synchronous_commit = on with a sync standby): The primary waits for the standby to acknowledge each transaction before committing. If the primary dies, the standby has every committed transaction. Zero data loss is guaranteed.
Asynchronous replication (the default): The primary commits immediately and streams WAL to the standby asynchronously. If the primary dies, the standby might be a few transactions behind. Those in-flight transactions are lost.
For pg_trickle, this means:
-
Synchronous mode: After failover, the promoted standby has all committed source table changes and all committed stream table states. The change buffers accurately reflect "changes since last refresh." Resuming the scheduler produces correct results.
-
Asynchronous mode: After failover, some recent source table changes might be lost (they were committed on the old primary but not yet replicated). This is the same data loss that affects all tables — it's not specific to pg_trickle. Stream tables might show slightly stale results (they reflect a state a few transactions behind), but they'll catch up on the next refresh.
Patroni Failover Sequence
When Patroni detects the primary is unhealthy and initiates failover:
- Fencing: The old primary is isolated (network fence,
pg_ctl stop, or shutdown) - Promotion: The most up-to-date standby is promoted with
pg_ctl promote - Reconnection: Clients are redirected to the new primary (via HAProxy, DNS, or Patroni's REST API)
- Timeline advance: The promoted standby starts a new WAL timeline
From pg_trickle's perspective:
-
Step 1–2: The background worker on the old primary is killed (either by
SIGTERMfrom Patroni or by the postmaster shutdown). Any in-flight refresh is aborted. This is safe — refreshes are transactional, so an interrupted refresh rolls back cleanly. -
Step 2 (on new primary): PostgreSQL starts the
pg_tricklebackground worker as part of the promoted standby's startup sequence. The worker reads its state from the catalog tables (which are now read-write) and resumes scheduling. -
Step 3: Client applications reconnect and resume writing to source tables. CDC triggers on the new primary capture changes into change buffer tables.
-
Step 4: The DAG scheduler picks up where it left off — processing pending changes in the buffers and refreshing stream tables according to their configured schedule.
The Refresh-in-Progress Problem
What if a refresh was halfway through when the primary crashed? The refresh involves:
- Reading change buffers
- Computing deltas
- Applying deltas to the stream table
- Truncating processed change buffers
All of this happens within a single transaction. If the primary crashes at any point during this transaction, the transaction rolls back. On the new primary:
- The change buffer still contains all unprocessed changes (the truncation never committed)
- The stream table still reflects the pre-refresh state (the delta application never committed)
- The next refresh processes the same changes successfully
This is the beauty of transactional refresh: crash recovery is automatic and correct. No manual intervention, no reconciliation, no "replaying from checkpoint."
Shared Memory Reconstruction
pg_trickle uses shared memory for scheduler state: which stream tables need refresh, when they were last refreshed, how long the last refresh took. This state is not persisted to disk — it lives only in shared memory.
When the new primary starts the pg_trickle background worker, it reconstructs shared memory state from the catalog:
- Reads all stream table definitions from
pgtrickle.pgt_stream_tables - Reads dependency information to rebuild the DAG
- Checks change buffer tables for pending changes
- Initializes refresh timestamps to "never" (forcing a refresh check on the next cycle)
The first refresh cycle after failover might refresh more stream tables than strictly necessary (because it doesn't know how recently each was refreshed on the old primary). This is harmless — a redundant refresh produces correct results, just with slightly more work.
Split-Brain Prevention
The most dangerous HA scenario is split-brain: both the old primary and new primary accept writes simultaneously. This can cause divergent change buffers and inconsistent stream table state.
Patroni prevents split-brain through fencing — the old primary is forcefully stopped before the new primary is promoted. But fencing can fail. To protect against split-brain at the pg_trickle level:
-
pg_trickle.enabledGUC: Set tooffon standbys. The background worker checks this on startup and does nothing if disabled. Only the promoted primary (where Patroni sets it toon) runs the scheduler. -
Advisory locks: The scheduler acquires a cluster-wide advisory lock before performing refreshes. If a stale primary somehow continues running, its lock acquisition will fail (or it will hold a lock that's irrelevant since clients have moved to the new primary).
-
Timeline-aware buffers: Change buffer entries include the WAL timeline. After a timeline fork, entries from the wrong timeline can be identified and discarded.
Configuring for Zero Data Loss
For workloads where stream table consistency is critical (financial analytics, billing aggregations):
# postgresql.conf on primary
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
synchronous_commit = on
# pg_trickle specific
pg_trickle.enabled = on
pg_trickle.refresh_on_promote = true # force immediate refresh after failover
With synchronous replication, the promoted standby is guaranteed to have all committed data. The refresh_on_promote setting triggers an immediate refresh cycle on promotion, ensuring stream tables are current before client connections arrive.
For workloads where sub-second staleness is acceptable:
# Asynchronous replication (lower latency, higher throughput)
synchronous_commit = off
# pg_trickle will catch up after failover
pg_trickle.enabled = on
pg_trickle.scheduler_interval = '1s'
After failover, stream tables might be 1–2 seconds stale. The scheduler catches up within one interval, and the system is fully current. No manual intervention.
Testing Failover
Validate your HA configuration with deliberate failover testing:
# Simulate primary failure
patronictl failover --candidate standby1 --force
# Verify pg_trickle is running on new primary
psql -h new-primary -c "SELECT pgtrickle.version();"
psql -h new-primary -c "SELECT * FROM pgtrickle.stream_table_status();"
# Verify stream tables are being refreshed
psql -h new-primary -c "
INSERT INTO source_table (value) VALUES ('after-failover');
SELECT pgtrickle.refresh_stream_table('my_stream');
SELECT * FROM my_stream WHERE value = 'after-failover';
"
The test should confirm:
- The pg_trickle background worker is running on the new primary
- Stream table metadata is intact
- CDC triggers are active on the new primary
- Incremental refresh processes new changes correctly
CloudNativePG and Kubernetes
For Kubernetes deployments using CloudNativePG (CNPG), the considerations are similar but the mechanisms differ:
- CNPG manages failover via Pod deletion and promotion
- The pg_trickle shared library is included in the container image
- GUCs are set via the
Clustercustom resource - Failover is typically faster (seconds) due to the container orchestration model
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: analytics-cluster
spec:
instances: 3
postgresql:
parameters:
shared_preload_libraries: "pg_trickle"
pg_trickle.enabled: "on"
pg_trickle.scheduler_interval: "2s"
storage:
size: 100Gi
CNPG ensures that only the primary instance has pg_trickle.enabled = on effective (standbys are read-only by definition). After a failover, the new primary's background worker starts automatically.
Monitoring Failover Health
After a failover event, verify pg_trickle's health:
-- Check scheduler is running
SELECT * FROM pgtrickle.scheduler_status();
-- Check for stale stream tables (last_refreshed should be recent)
SELECT name, last_refreshed_at, refresh_mode
FROM pgtrickle.stream_table_status()
WHERE last_refreshed_at < now() - interval '1 minute'
ORDER BY last_refreshed_at;
-- Check change buffer sizes (should be draining, not growing)
SELECT relname, n_live_tup
FROM pg_stat_user_tables
WHERE schemaname = 'pgtrickle_changes'
ORDER BY n_live_tup DESC;
If change buffers are growing and stream tables aren't refreshing, the scheduler might not have started. Check the PostgreSQL log for background worker startup messages and verify the pg_trickle.enabled GUC is on.
pg_trickle survives failover because its state is just PostgreSQL tables — replicated, durable, and transactional. Configure synchronous replication for zero data loss, or accept asynchronous with sub-second catch-up. Either way, your stream tables resume without manual intervention.
← Back to Blog Index | Documentation
HNSW Recall Is a Lie
How Distribution Drift Silently Breaks Similarity Search (and What to Do About It)
You built a similarity search feature. You measured recall at launch: 94% — excellent. You ship it, add it to the demo, put it on the homepage.
Six months later, a user complains that the results feel "off." You run the same recall measurement: 71%. You've been serving degraded results for months and had no idea.
This is distribution drift. It's real, it's common, and it's almost never discussed in tutorials. Here's what it is, why it happens, how to measure it, and what pg_trickle's drift-aware reindex policy does about it.
What IVFFlat Actually Does (and Why That's a Problem)
IVFFlat (Inverted File with Flat quantization) is a classical approximate nearest-neighbor algorithm. It works in two phases:
Build phase: Train a k-means clustering on your vectors. Divide them into lists groups (centroids). Each vector is assigned to its nearest centroid. The index stores, for each centroid, the list of vectors in that group.
Query phase: Given a query vector, find the probes nearest centroids. Search only the vectors assigned to those centroids.
The speed comes from only searching probes / lists fraction of your data. The recall comes from the assumption that your query vector's nearest neighbors are in the nearest centroids.
The problem: The centroid assignments are computed from the data distribution at build time. As your data distribution changes, the centroid assignments become stale.
Imagine you're building a product recommendation system in January. Your product catalog skews toward winter items: coats, heaters, boots. You build the IVFFlat index. The centroids reflect a winter distribution.
By June, you've added thousands of summer items: swimwear, grills, sunscreen. These items get assigned to the nearest existing centroid at insertion time — but those centroids were trained on winter data. Your summer items end up crowded into a handful of winter centroids that happen to be geometrically nearest.
Now when someone searches for "outdoor summer activities," the query vector points toward the summer region of the embedding space. The two nearest centroids happen to be the winter outdoor centroid and the spring gardening centroid (geometrically closest to the summer space). The actual summer items are scattered across three other centroids that were not searched. Recall degrades.
The degradation is gradual and silent. Each new insert makes it slightly worse. Users experience it as "the search feels off" before you ever measure it.
HNSW: The Tombstone Problem
HNSW (Hierarchical Navigable Small World graph) has a different problem.
HNSW is a graph-based index. Each vector is a node. Edges connect nearby nodes across multiple layers, enabling logarithmic-time traversal. Unlike IVFFlat, HNSW doesn't need retraining as data changes — new nodes are inserted by connecting them into the existing graph.
But HNSW doesn't actually delete nodes. When you delete a vector, the node is marked as dead (a tombstone) but its edges remain in the graph. Graph traversal still has to navigate through and around tombstones.
The consequences:
Build slowdown: As tombstones accumulate, inserting new nodes requires connecting through more dead nodes. Insertion time increases.
Query slowdown: Traversal visits tombstones and has to backtrack more often. Query latency increases.
Index size: The index doesn't shrink when rows are deleted. Tombstones consume space.
The pgvector documentation addresses this: "Vacuuming can take a while for HNSW indexes. Speed it up by reindexing first." But this is reactive advice — the documentation doesn't tell you when to reindex, how to measure when you need to, or how to automate it.
The answer for most teams is a scheduled REINDEX CONCURRENTLY — weekly, or monthly, or "when someone notices it's slow." This is a blunt instrument.
Measuring Distribution Drift
The right metric for IVFFlat drift is recall — the fraction of true nearest neighbors returned by approximate search.
The standard measurement procedure:
-- 1. Sample query vectors
SELECT id, embedding FROM items ORDER BY random() LIMIT 1000;
-- 2. For each sample, run exact search (seq scan)
-- and approximate search (index scan), compare results
BEGIN;
SET LOCAL enable_indexscan = off; -- force exact search
SELECT id, embedding <=> $query AS distance
FROM items ORDER BY distance LIMIT 10;
COMMIT;
-- vs.
SELECT id, embedding <=> $query AS distance
FROM items ORDER BY distance LIMIT 10; -- uses index
-- Recall = |exact_results ∩ approx_results| / |exact_results|
Running this for 1,000 sample queries gives a statistically reliable recall estimate. Do it once a week. When recall drops below your target (say, 85%), rebuild the index.
The problem is that this procedure is expensive (1,000 queries × 2 modes), requires a separate monitoring job, and produces a number that you then have to act on manually.
A Simpler Proxy: Row Change Rate
Exact recall measurement is the ground truth. But there's a cheaper proxy that correlates well with drift: the fraction of rows that have changed since the last index build.
If 30% of your vectors have been inserted, updated, or deleted since the last REINDEX, your index reflects a very different distribution than what's currently in the table. The centroid assignments are stale for 30% of the data. The tombstone fraction is significant.
This isn't a perfect predictor of recall degradation — the actual impact depends on whether the changes are uniformly distributed or concentrated in specific regions of the embedding space. But as a heuristic for "this index needs attention," it's reliable.
pg_trickle tracks this metric as rows_changed_since_last_reindex per stream table. The pgtrickle.vector_status() view (v0.38) exposes it:
SELECT
stream_table,
total_rows,
rows_changed_since_reindex,
ROUND(rows_changed_since_reindex::numeric / total_rows * 100, 1) AS drift_pct,
last_reindex_at,
last_refresh_at
FROM pgtrickle.vector_status();
stream_table | total_rows | rows_changed | drift_pct | last_reindex_at
-----------------+------------+--------------+-----------+----------------
product_corpus | 2,400,000 | 312,000 | 13.0 | 2026-04-01
user_taste | 850,000 | 51,000 | 6.0 | 2026-04-20
doc_embeddings | 125,000 | 42,000 | 33.6 | 2026-03-15
doc_embeddings at 33.6% drift is the one to worry about. It was last reindexed 6 weeks ago and has had significant churn.
Drift-Aware Automatic Reindexing
Manual monitoring and scheduled rebuilds are operational debt. pg_trickle v0.38 introduces a policy-based approach:
SELECT pgtrickle.alter_stream_table(
'doc_embeddings',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.15 -- trigger at 15% drift
);
After each refresh cycle, the scheduler checks whether rows_changed_since_reindex / total_rows > 0.15. If so, it enqueues a REINDEX CONCURRENTLY for the vector column's index in a background worker tier with lower priority than the refresh worker.
REINDEX CONCURRENTLY doesn't lock the table for reads. Queries continue executing against the old index while the new index is built. When the build completes, PostgreSQL atomically swaps the indexes. The downtime window is zero.
The sequence:
- Drift exceeds 15%.
- pg_trickle enqueues
REINDEX CONCURRENTLYin a low-priority tier. - The reindex starts. Queries use the old index.
- The reindex completes (time proportional to table size and available parallelism).
- Indexes are swapped atomically.
rows_changed_since_reindexresets to 0.
No oncall page. No manual REINDEX command. No recall degradation that goes undetected for months.
Choosing the Threshold
15% is a reasonable starting point, but the right threshold depends on your data and quality requirements.
High-recall requirements (>90%): Use a lower threshold — 5–10%. Accept more frequent reindexes in exchange for tighter recall guarantees.
Stable data distributions: If your vectors come from a domain that doesn't evolve rapidly (e.g., a product catalog in a stable category), you can use a higher threshold — 20–25%.
Volatile distributions: If you're embedding news articles, tweets, or rapidly evolving content, drift happens fast. Use a lower threshold and a more frequent monitoring cadence.
For HNSW specifically, the threshold should also account for tombstone fraction:
SELECT pgtrickle.alter_stream_table(
'user_taste',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.20, -- 20% general drift
reindex_tombstone_ratio => 0.10 -- or 10% tombstones
);
Either condition triggers a reindex. The tombstone ratio catches the HNSW deletion problem independently of the distribution drift.
IVFFlat: Also Rebalance Your Lists
One more thing about IVFFlat that's worth knowing.
When you run REINDEX, the new IVFFlat index re-runs k-means clustering from scratch. It's trained on the current data distribution. The cluster assignments are fresh.
How many lists should you use? The pgvector README recommends:
- Up to 1M rows:
lists = rows / 1000 - Over 1M rows:
lists = sqrt(rows)
But these are starting points. If your data has changed significantly in structure — you've moved from a single domain to multiple domains, for example — the optimal lists count may have changed too.
pg_trickle doesn't (yet) auto-tune lists. You'll need to specify it in your index definition:
CREATE INDEX CONCURRENTLY product_corpus_ivf_idx
ON product_corpus USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 500);
After a reindex triggered by drift, check recall measurement again. If recall improved less than expected, try a different lists value.
The Combined Problem
In practice, both problems — IVFFlat distribution drift and HNSW tombstone accumulation — occur together. A production embedding table:
- Grows as new content is added (HNSW: new insertions, IVFFlat: new insertions to stale centroids)
- Is updated as content changes (HNSW: delete old node, insert new node = tombstone + new node)
- Has documents deleted as content is removed (HNSW: tombstones, IVFFlat: removed from centroid without rebalancing)
Without automated reindexing, this produces a slow, steady recall degradation that's invisible until a user reports it. With drift-aware reindexing, you set the policy once and let pg_trickle maintain the health of your vector indexes alongside the freshness of your data.
What This Doesn't Solve
Drift-aware reindexing helps. It doesn't solve everything.
Very large tables: A REINDEX CONCURRENTLY on a 100M-row vector table takes hours. Even at a 15% drift threshold, you might be doing this every few weeks. At some scale, you need a sharding strategy (partitioned tables with per-partition indexes, or a dedicated vector database) rather than monolithic reindexing.
Embed-model changes: If you change your embedding model (e.g., from OpenAI text-embedding-3-small to text-embedding-3-large), all your vectors need to be recomputed. This isn't a drift problem — it's a full migration. pg_trickle's reindexing helps after the re-embedding is done, but it doesn't drive the re-embedding itself.
Recall monitoring: Drift percentage is a proxy. It doesn't tell you whether recall has actually degraded. If you need hard recall SLAs, you need recall measurement in addition to drift monitoring. pg_trickle's vector_status() view is meant to sit alongside, not replace, periodic recall sampling.
The message is: distribution drift is a real production problem, most teams discover it too late, and the tooling to manage it automatically now exists. Use it.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
The CDC Mode You Never Have to Choose
How pg_trickle's hybrid change-data-capture starts with triggers and graduates to WAL
Every IVM system needs to know what changed. That's the CDC (change data capture) problem: given a source table, produce a stream of inserts, updates, and deletes.
PostgreSQL gives you two mechanisms: row-level triggers and logical replication (WAL decoding). Triggers are always available but add overhead to every DML statement. WAL-based CDC has near-zero write-side overhead but requires wal_level = logical, a replication slot, and a decoder plugin.
Most systems make you choose. pg_trickle doesn't. Its default CDC mode — AUTO — starts with triggers because they always work, then silently transitions to WAL-based capture when the prerequisites are met. If the WAL decoder fails, it falls back to triggers without losing a single change.
This post explains the three CDC modes, the transition orchestration, and why you should almost certainly leave the default alone.
The Three Modes
Trigger Mode (cdc_mode => 'trigger')
pg_trickle installs row-level AFTER INSERT OR UPDATE OR DELETE triggers on each source table. When a row changes, the trigger writes a copy of the changed row (new values for INSERT/UPDATE, old values for DELETE) into a change buffer table in the pgtrickle_changes schema.
orders (source) → AFTER trigger → pgtrickle_changes.changes_<oid>
Pros:
- Works on any PostgreSQL installation, no configuration changes
- Works with foreign tables
- Works on replicas (if they're writable, e.g., Citus workers)
- Single-transaction atomicity — the change buffer write is in the same transaction as the source DML
Cons:
- Every INSERT/UPDATE/DELETE on the source table does extra work (the trigger fires, the buffer row is written)
- For high-throughput tables (>10,000 rows/second), the trigger overhead is measurable: roughly 10–15% additional CPU time per DML statement
WAL Mode (cdc_mode => 'wal')
pg_trickle creates a logical replication slot and a background WAL decoder worker that reads the write-ahead log and decodes changes into the same buffer tables.
orders (source) → WAL → pg_trickle WAL decoder → pgtrickle_changes.changes_<oid>
Pros:
- Near-zero write-side overhead — no trigger fires during DML
- Better throughput under high write load
- The WAL decoder is a single reader, not per-statement
Cons:
- Requires
wal_level = logical(which means more WAL volume) - Requires a replication slot (which holds WAL segments until consumed)
- If the decoder falls behind, WAL accumulates on disk
- Doesn't work with foreign tables
Auto Mode (cdc_mode => 'auto') — The Default
Auto mode starts with triggers. In the background, it checks whether WAL-based CDC is available. If it is, it transitions. If the transition fails or the WAL decoder crashes, it falls back to triggers.
This is the default, and for good reason: it means you don't have to think about CDC mode during initial setup. Install pg_trickle, create stream tables, and things work immediately with triggers. Later, when you tune wal_level = logical for other reasons (or because you want lower write overhead), pg_trickle picks it up automatically.
The Transition: How It Actually Works
The trigger-to-WAL transition is the most delicate part of the CDC subsystem. The goal: switch from trigger-based capture to WAL-based capture without missing any changes and without double-counting.
It happens in three steps:
Step 1: Slot Creation
pg_trickle creates a logical replication slot. The slot starts capturing WAL from the current LSN (log sequence number). At this point, both triggers and the slot are active — triggers handle current DML, and the slot starts accumulating WAL for future use.
Step 2: Decoder Catch-Up
The WAL decoder worker starts reading from the slot. It needs to reach the current LSN — the point where it's caught up with the live write stream. pg_trickle waits until the decoder's consumed LSN is within a configurable threshold of the current WAL position.
This step has a timeout: pg_trickle.wal_transition_timeout (default 300 seconds). If the decoder can't catch up in 5 minutes — maybe because write throughput is extremely high — the transition is aborted, and triggers stay active.
Step 3: Trigger Drop
Once the decoder is caught up, pg_trickle atomically:
- Marks the source table's CDC mode as
walin the catalog. - Records the frontier LSN — the exact point where WAL takes over.
- Drops the row-level trigger.
From this point forward, the WAL decoder handles all change capture. The change buffer tables are the same — only the writer changes.
The Fallback: WAL → Triggers
If the WAL decoder crashes, falls too far behind, or the replication slot is dropped, pg_trickle detects the failure and falls back:
- Re-installs the row-level trigger on the source table.
- Marks the CDC mode as
triggerin the catalog. - Logs a warning:
WARNING: pg_trickle WAL decoder for 'orders' failed (slot_lag_exceeded),
falling back to trigger-based CDC
The fallback is safe because the frontier tracking is LSN-based. pg_trickle knows exactly which changes were captured by the WAL decoder and which weren't. The trigger picks up from the current transaction, and the next refresh processes the union of WAL-captured and trigger-captured changes.
No changes are lost. No changes are double-counted.
Monitoring the CDC State
You can see the current CDC mode for every source table:
SELECT * FROM pgtrickle.pgt_cdc_status;
source_table | cdc_mode | slot_name | slot_lag_bytes | trigger_active
---------------+----------+--------------------+----------------+----------------
orders | wal | pgt_slot_orders | 4096 | f
customers | trigger | NULL | NULL | t
products | wal | pgt_slot_products | 12288 | f
inventory | auto | NULL | NULL | t
Key columns:
cdc_mode: The effective mode right now.slot_lag_bytes: How far behind the WAL decoder is. Non-zero is normal; growing continuously is a problem.trigger_active: Whether the row-level trigger is installed. In WAL mode, this isfalse.
When to Override the Default
Almost never. But there are cases:
Force trigger mode when:
- You're using foreign tables as stream table sources (WAL doesn't capture foreign table changes)
- You need the single-transaction atomicity guarantee for IMMEDIATE mode (triggers fire in the same transaction; WAL decoding is async)
- You're on a managed PostgreSQL service that doesn't allow
wal_level = logical
SELECT pgtrickle.create_stream_table(
name => 'inventory_levels',
query => $$ ... $$,
schedule => '5s',
cdc_mode => 'trigger'
);
Force WAL mode when:
- Write throughput is very high (>50,000 rows/second) and trigger overhead is measurable
- You want to minimize the CPU impact on the write path
SELECT pgtrickle.create_stream_table(
name => 'event_aggregates',
query => $$ ... $$,
schedule => '2s',
cdc_mode => 'wal'
);
If you force WAL mode and the prerequisites aren't met (wal_level != logical), pg_trickle will error at creation time:
ERROR: WAL-based CDC requires wal_level = logical
HINT: Set wal_level = logical in postgresql.conf and restart,
or use cdc_mode => 'auto' to start with triggers
The WAL Backpressure Safety Net
Since v0.36.0, pg_trickle enforces WAL backpressure when pg_trickle.enforce_backpressure = on. If the replication slot lag exceeds a critical threshold, CDC is paused to prevent unbounded WAL accumulation.
The sequence:
- Slot lag exceeds
wal_backpressure_critical_bytes(default 1GB). - pg_trickle pauses the WAL decoder and emits a WARNING.
- When lag drops below
wal_backpressure_resume_bytes(default 512MB), decoding resumes.
This hysteresis prevents the pathological case where a slow consumer causes WAL to pile up until the disk fills. It's the same pattern as TCP flow control — back off when the receiver can't keep up, resume when it catches up.
Performance: Triggers vs. WAL
Benchmarks on a 4-core PostgreSQL 18 instance, 10,000 rows/second sustained write load:
| Metric | Trigger CDC | WAL CDC |
|---|---|---|
| Write latency (p50) | 0.42ms | 0.38ms |
| Write latency (p99) | 1.8ms | 1.1ms |
| CPU overhead per DML | ~12% | ~1% |
| Change capture latency | 0ms (synchronous) | 2–15ms (async) |
| WAL volume increase | None | ~20% (logical decoding) |
The trade-off is clear: triggers cost more per write but have zero capture latency. WAL costs less per write but introduces a small delay between the DML commit and the change appearing in the buffer.
For IMMEDIATE mode (synchronous IVM), triggers are mandatory — the stream table must be updated in the same transaction. For DIFFERENTIAL mode with a schedule of 1 second or more, the 2–15ms capture latency is invisible.
Summary
pg_trickle's CDC subsystem has three modes, but you only need to know one: AUTO. It starts with triggers because they always work. When WAL-based capture becomes available, it transitions automatically. If WAL fails, it falls back without losing changes.
The transition is orchestrated in three steps: slot creation, decoder catch-up, trigger drop. The fallback is safe because frontier tracking is LSN-based.
Override the default only when you have a specific reason: foreign tables, IMMEDIATE mode, or measured trigger overhead on a high-write table. For everything else, let pg_trickle figure it out.
← Back to Blog Index | Documentation
IMMEDIATE Mode: When "Good Enough Freshness" Isn't Good Enough
Synchronous IVM inside your transaction
Most of the conversation about incremental view maintenance focuses on latency: how fast can you refresh? A second? 500 milliseconds? 100?
But there's a class of problems where any refresh lag is too much. If you're computing an account balance, a running inventory count, or a double-entry bookkeeping ledger, reading the stream table and getting a result that's 200ms behind the write that just happened in the same request — that's a bug.
This is what refresh_mode => 'IMMEDIATE' does: it applies the delta inside the same transaction that caused the change. No background worker. No schedule. No lag. When INSERT INTO orders (...) commits, the stream table already reflects that order.
How DIFFERENTIAL and IMMEDIATE Differ
DIFFERENTIAL mode (the default) works like this:
Transaction 1: INSERT INTO orders (...) → trigger fires → change buffer row written
Transaction 1: COMMIT
... 1–5 seconds later ...
Background worker: drain change buffer → compute delta → MERGE into stream table
There's a window between commit and refresh. Your application inserts an order and then immediately queries the stream table — the order isn't there yet. For dashboards that refresh every few seconds, this is fine. For a checkout flow that needs to show the updated total on the next page load, it's not.
IMMEDIATE mode eliminates the window:
Transaction 1: INSERT INTO orders (...) → trigger fires → delta computed inline → MERGE applied
Transaction 1: COMMIT
The delta computation runs as part of the trigger execution. By the time the transaction commits, the stream table is updated. There's no window. The next SELECT from the stream table — even in the same transaction — sees the new data.
The Trade-Off
IMMEDIATE mode isn't free. The delta computation happens on the write path. Every INSERT, UPDATE, or DELETE on a source table does more work — the trigger has to compute the delta and apply it before control returns to your application.
For a simple aggregation over one table, this overhead is small — a few hundred microseconds per row. For a multi-table JOIN with several aggregation groups, it can add single-digit milliseconds per write.
The decision matrix:
| Situation | Mode |
|---|---|
| Dashboard queries, analytics, reporting | DIFFERENTIAL (1–5s schedule) |
| Read-your-writes required, low write throughput | IMMEDIATE |
| High write throughput, some lag acceptable | DIFFERENTIAL |
| Financial calculations, inventory, bookkeeping | IMMEDIATE |
| Write-heavy event logging | DIFFERENTIAL |
If you're unsure, start with DIFFERENTIAL. Switch to IMMEDIATE when you hit a case where read-your-writes consistency matters.
A Concrete Example: Account Balances
Double-entry bookkeeping is the textbook use case. Every financial transaction creates two journal entries — a debit and a credit. The account balance is the sum of all entries for that account.
-- Journal entries table
CREATE TABLE journal_entries (
id bigserial PRIMARY KEY,
account_id bigint NOT NULL REFERENCES accounts(id),
entry_type text NOT NULL CHECK (entry_type IN ('debit', 'credit')),
amount numeric(15,2) NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
-- Account balance: always consistent, always current
SELECT pgtrickle.create_stream_table(
'account_balances',
$$SELECT
account_id,
SUM(CASE WHEN entry_type = 'credit' THEN amount ELSE -amount END) AS balance,
COUNT(*) AS entry_count,
MAX(created_at) AS last_entry_at
FROM journal_entries
GROUP BY account_id$$,
refresh_mode => 'IMMEDIATE'
);
Now when your application writes a journal entry and reads the balance in the same transaction:
BEGIN;
INSERT INTO journal_entries (account_id, entry_type, amount)
VALUES (42, 'credit', 150.00);
-- This returns the updated balance, including the $150 credit
SELECT balance FROM account_balances WHERE account_id = 42;
COMMIT;
No race condition. No eventual consistency. No background worker to wait for.
Inventory Tracking
Same pattern, different domain. An e-commerce warehouse tracks stock with an events table:
CREATE TABLE stock_events (
id bigserial PRIMARY KEY,
sku text NOT NULL,
warehouse text NOT NULL,
quantity int NOT NULL, -- positive = received, negative = shipped
event_type text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
SELECT pgtrickle.create_stream_table(
'inventory_levels',
$$SELECT
sku,
warehouse,
SUM(quantity) AS on_hand,
SUM(CASE WHEN quantity < 0 THEN -quantity ELSE 0 END) AS total_shipped,
SUM(CASE WHEN quantity > 0 THEN quantity ELSE 0 END) AS total_received
FROM stock_events
GROUP BY sku, warehouse$$,
refresh_mode => 'IMMEDIATE'
);
When the checkout service writes a shipment event, inventory_levels.on_hand is decremented in the same transaction. The next availability check — even in the same request — sees the correct count. No overselling because a background worker hadn't caught up.
What IMMEDIATE Mode Restricts
Not every query supports IMMEDIATE mode. The restriction exists because the trigger must compute the delta synchronously, and some operators require access to data that isn't available in the trigger context.
Queries that work in IMMEDIATE mode:
- Single-table or multi-table JOINs with aggregates (SUM, COUNT, AVG, MIN, MAX)
- WHERE filters on source columns
- CASE expressions in aggregates
- GROUP BY on any column or expression
Queries that require DIFFERENTIAL or FULL mode:
- Window functions (RANK, ROW_NUMBER, LAG, LEAD) — these need the full partition
- HAVING clauses that reference aggregate results from other groups
- Queries referencing
now()or other volatile functions in the defining query - DISTINCT without GROUP BY
- LIMIT / OFFSET
pg_trickle tells you if your query isn't compatible:
SELECT pgtrickle.create_stream_table(
'bad_example',
$$SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
FROM employees$$,
refresh_mode => 'IMMEDIATE'
);
-- ERROR: query uses window functions, which are not supported in IMMEDIATE mode.
-- HINT: Use refresh_mode => 'DIFFERENTIAL' or 'FULL' instead.
Mixing Modes in a DAG
A common pattern is to use IMMEDIATE for the leaf stream tables that face the application, and DIFFERENTIAL for upstream aggregation layers:
-- Silver layer: cleaned, enriched orders (DIFFERENTIAL, 2s schedule)
SELECT pgtrickle.create_stream_table(
'orders_enriched',
$$SELECT o.id, o.amount, o.created_at,
c.name, c.region, c.tier
FROM orders o
JOIN customers c ON c.id = o.customer_id$$,
schedule => '2s',
refresh_mode => 'DIFFERENTIAL'
);
-- Gold layer: per-customer balance (IMMEDIATE, no schedule needed)
SELECT pgtrickle.create_stream_table(
'customer_totals',
$$SELECT customer_id, SUM(amount) AS total_spend, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id$$,
refresh_mode => 'IMMEDIATE'
);
The dashboard reads from orders_enriched (2-second lag is fine). The checkout flow reads from customer_totals (zero lag, because the application just wrote the order).
When to Move From IMMEDIATE to DIFFERENTIAL
If your write throughput grows and the per-write overhead of IMMEDIATE mode starts showing up in your P99 latency, switch:
SELECT pgtrickle.alter_stream_table(
'account_balances',
refresh_mode => 'DIFFERENTIAL',
schedule => '1 second'
);
One-second staleness is usually acceptable for everything except the strictest consistency requirements. And for those, you can keep the critical tables on IMMEDIATE while moving the less critical ones to DIFFERENTIAL.
The ALTER is online. No downtime. The stream table continues serving reads during the switch.
← Back to Blog Index | Documentation
The Inbox Pattern: Receiving Events from Kafka into PostgreSQL
Idempotent, ordered, exactly-once event ingestion without writing a consumer
The outbox pattern gets all the attention. You write an event to an outbox table in the same transaction as your business data, and an external process delivers it to Kafka/NATS/SQS. Problem solved: no dual-write, no inconsistency.
But what about the other direction? Your service needs to receive events from Kafka and process them. You need:
- Events to land in PostgreSQL reliably.
- Duplicate events to be handled idempotently.
- Events to be processed in order.
- Failed processing to not lose the event.
Most teams build this with a Kafka consumer in their application code: poll, deserialize, write to the database, commit the offset. It works, but it's another piece of infrastructure to maintain, monitor, and debug.
pg_trickle's inbox pattern moves the consumer into PostgreSQL.
How the Inbox Works
1. Create an inbox
SELECT pgtrickle.create_inbox('payment_events');
This creates:
pgtrickle.inbox_payment_events— the inbox table where events land- Deduplication infrastructure (unique constraint on event ID)
- An ordering guarantee (sequence number per partition)
2. Configure the relay to deliver events
# relay.toml
[[pipeline]]
name = "payments-inbound"
source = { type = "kafka", brokers = "kafka:9092", topic = "payment.completed", group_id = "pgtrickle-payments" }
sink = { type = "inbox", inbox_name = "payment_events" }
The relay reads from Kafka and writes to the inbox table. Each Kafka message becomes a row in pgtrickle.inbox_payment_events.
3. Build stream tables on top of the inbox
-- Aggregate payment events into per-customer totals
SELECT pgtrickle.create_stream_table(
'customer_payment_totals',
$$SELECT
(payload->>'customer_id')::bigint AS customer_id,
SUM((payload->>'amount')::numeric) AS total_paid,
COUNT(*) AS payment_count,
MAX((payload->>'completed_at')::timestamptz) AS last_payment
FROM pgtrickle.inbox_payment_events
GROUP BY (payload->>'customer_id')::bigint$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
Events from Kafka are now a PostgreSQL table, with incremental aggregation on top.
Deduplication
Kafka guarantees at-least-once delivery. The same event can be delivered multiple times — network retries, consumer rebalances, relay restarts.
The inbox handles deduplication using the event's unique identifier:
-- Inbox table structure (simplified)
CREATE TABLE pgtrickle.inbox_payment_events (
inbox_seq bigserial PRIMARY KEY,
event_id text NOT NULL UNIQUE, -- Kafka message key or a field from the payload
partition_key text,
payload jsonb NOT NULL,
received_at timestamptz NOT NULL DEFAULT now(),
processed boolean NOT NULL DEFAULT false
);
The UNIQUE constraint on event_id means that if the relay delivers the same event twice, the second INSERT is silently dropped (using ON CONFLICT DO NOTHING). No duplicates in the inbox, no duplicates in the stream table.
The event ID comes from the Kafka message key by default. You can configure the relay to extract it from the payload:
[[pipeline]]
name = "payments-inbound"
source = { type = "kafka", brokers = "kafka:9092", topic = "payment.completed", group_id = "pgtrickle-payments" }
sink = { type = "inbox", inbox_name = "payment_events", dedup_key = "$.payment_id" }
Ordering
Events from the same Kafka partition arrive in order. The inbox preserves this ordering with a partition_key column and a sequence number.
If your application needs to process events in order per customer:
SELECT pgtrickle.enable_inbox_ordering('payment_events', partition_key => 'customer_id');
This ensures that the relay inserts events for the same customer sequentially, and the stream table's delta computation respects the ordering within each partition.
For most use cases — aggregations, counts, sums — ordering doesn't matter. The aggregate is the same regardless of insertion order. But for stateful processing (where event N depends on event N-1), ordering matters and the inbox preserves it.
Dead-Letter Queue
If the relay can't deserialize a Kafka message (malformed JSON, unexpected schema), it routes the message to a dead-letter table:
-- Automatically created alongside the inbox
-- pgtrickle.inbox_payment_events_dlq
SELECT * FROM pgtrickle.inbox_payment_events_dlq;
| inbox_dlq_seq | event_id | raw_payload | error_message | received_at |
|---|---|---|---|---|
| 1 | pay_99 | {invalid json | "unexpected end of JSON input" | 2026-04-27 10:00:01 |
Dead-letter events don't affect the main inbox or the stream tables. You can inspect them, fix the upstream producer, and replay them manually if needed.
A Complete Example: Order Fulfillment
Your order service publishes events to Kafka when orders are placed. Your inventory service (running PostgreSQL + pg_trickle) needs to track orders per SKU:
# Inventory service relay config
[[pipeline]]
name = "orders-inbound"
source = { type = "kafka", brokers = "kafka:9092", topic = "orders.placed", group_id = "inventory-service" }
sink = { type = "inbox", inbox_name = "incoming_orders", dedup_key = "$.order_id" }
-- Create the inbox
SELECT pgtrickle.create_inbox('incoming_orders');
-- Stream table: pending fulfillment per SKU
SELECT pgtrickle.create_stream_table(
'pending_fulfillment',
$$SELECT
item->>'sku' AS sku,
SUM((item->>'quantity')::int) AS total_quantity,
COUNT(DISTINCT (payload->>'order_id')) AS order_count
FROM pgtrickle.inbox_incoming_orders,
jsonb_array_elements(payload->'items') AS item
WHERE NOT processed
GROUP BY item->>'sku'$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
When the order service publishes an event, it flows through Kafka → relay → inbox → stream table. The pending_fulfillment table shows how many units of each SKU need to be shipped.
When the warehouse marks an order as shipped, the application updates the inbox row (processed = true), and the stream table's delta removes that order's contribution from the aggregate.
Inbox vs. Direct Kafka Consumer
| Aspect | Direct Kafka consumer | pg_trickle inbox |
|---|---|---|
| Code to write | Consumer class, deserialization, DB writes, offset management | TOML config + SQL |
| Deduplication | Application-level (you implement it) | Built-in (UNIQUE constraint) |
| Dead-letter queue | Application-level | Built-in |
| Aggregation | Application-level queries | Stream tables (incremental) |
| Monitoring | Custom metrics | pg_trickle monitoring (built-in) |
| Ordering | Kafka partition ordering + application logic | Preserved in inbox |
| Scaling | Consumer group + application instances | Relay instances + advisory lock HA |
The inbox pattern is particularly useful when:
- The events end up in PostgreSQL anyway (you just need them in a table)
- You want to aggregate the events incrementally
- You don't want to write and maintain consumer code
- You need exactly-once semantics in the database
If your event processing requires complex business logic that can't be expressed in SQL (calling external APIs, sending emails, orchestrating workflows), a direct consumer is more appropriate. The inbox is for data ingestion and aggregation, not arbitrary computation.
← Back to Blog Index | Documentation
Incremental Aggregates in PostgreSQL: No ETL Required
Running SUM, COUNT, AVG, and vector_avg over live tables without batch jobs
Every analytics system eventually needs pre-aggregated data. Raw tables get too big. Query latency becomes unacceptable. You start maintaining a separate summary table.
Then you need to keep it fresh. That's where things go wrong.
The standard playbook is to build an ETL pipeline: detect changes, compute new aggregates, write them back. This works. It also means you now own an ETL pipeline — a background process, a job scheduler, failure handling, backlog monitoring, and a latency SLA that's measured in minutes rather than seconds.
The question this post answers: do you actually need the ETL layer? For most SQL-expressible aggregates, the answer is no. pg_trickle can maintain running aggregates directly in PostgreSQL, incrementally, using the algebra of the aggregate functions themselves.
Why Batch Aggregation Is the Default
Aggregate queries are inherently read-heavy. Computing SUM(revenue) over 100 million orders requires reading 100 million rows. Computing it over 1 billion rows requires reading 1 billion rows. The compute cost scales linearly with the data.
When you need that aggregate continuously — refreshed every minute, every 10 seconds — you hit a ceiling. At some table size, the scan takes longer than the interval. You can't run a 30-second aggregate every 10 seconds.
The usual response is to compute the aggregate less often (accept stale data), or to push the computation out of PostgreSQL into a streaming system better suited to continuous processing (accept operational complexity).
Both choices feel like giving up something real. Incremental aggregation avoids that tradeoff.
The Algebraic Trick
The reason incremental aggregation is possible is that most aggregates have a well-defined delta function.
For SUM(x):
- Insert a row with value
v: delta =+v - Delete a row with value
v: delta =-v - Update a row from
old_vtonew_v: delta =+(new_v - old_v)
For COUNT(*):
- Insert: delta =
+1 - Delete: delta =
-1 - Update: delta =
0(row count unchanged)
For AVG(x):
- Maintain
running_sumandrunning_countseparately - AVG =
running_sum / running_count - Update
running_sumandrunning_countwith the delta - Recompute
AVG
This is O(1) per change, regardless of table size. The key insight is that SUM, COUNT, and AVG are all expressible in terms of a running state that can be updated with each delta. Mathematicians call this a semigroup with an inverse — each operation has a reverse (subtraction for addition, decrement for increment) that lets you both add and remove elements from the aggregate.
This is the mathematics underlying pg_trickle's DVM (differential view maintenance) engine. The engine carries a ruleset of delta functions for each supported aggregate and applies them when changes arrive.
Simple Aggregates: SUM, COUNT, AVG
-- Maintain real-time revenue metrics by region and day
SELECT pgtrickle.create_stream_table(
name => 'daily_revenue',
query => $$
SELECT
c.region,
date_trunc('day', o.created_at) AS day,
SUM(o.total) AS revenue,
COUNT(*) AS order_count,
AVG(o.total) AS avg_order_value,
COUNT(DISTINCT o.customer_id) AS unique_customers
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY c.region, date_trunc('day', o.created_at)
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
When an order is inserted:
- pg_trickle identifies the affected group
(region='europe', day='2026-04-27'). - The DVM engine computes
delta_revenue = +order.total,delta_count = +1,new_avg = (old_sum + order.total) / (old_count + 1). - A single UPDATE is applied to
daily_revenuefor that one row.
An order placed at 14:32 is reflected in daily_revenue within 5 seconds. No batch job. No ETL. No Kafka.
Note the COUNT(DISTINCT customer_id) — this is a case where the delta is more complex. DISTINCT counts require tracking membership, not just a running tally. pg_trickle handles this with a set-based approach, but for very high-cardinality distinct counts, HyperLogLog approximation is often more practical and is on the roadmap.
Conditional Aggregates
Real summary tables always have conditional logic — active vs. inactive, successful vs. failed, above/below threshold.
SELECT pgtrickle.create_stream_table(
name => 'support_dashboard',
query => $$
SELECT
team_id,
COUNT(*) AS total_tickets,
COUNT(*) FILTER (WHERE status = 'open') AS open_tickets,
COUNT(*) FILTER (WHERE status = 'urgent') AS urgent_tickets,
COUNT(*) FILTER (WHERE status = 'resolved'
AND resolved_at > NOW() - INTERVAL '24 hours')
AS resolved_today,
AVG(EXTRACT(EPOCH FROM (resolved_at - created_at)))
FILTER (WHERE resolved_at IS NOT NULL) AS avg_resolution_secs,
MAX(created_at) AS latest_ticket_at
FROM tickets
GROUP BY team_id
$$,
schedule => '3 seconds',
refresh_mode => 'DIFFERENTIAL'
);
The FILTER (WHERE ...) construct is a standard SQL conditional aggregate. pg_trickle handles it by carrying the condition into the delta computation. A ticket transitioning from open to resolved generates:
delta_open_tickets = -1delta_resolved_today = +1(if within the 24-hour window)- A recomputation of
avg_resolution_secs
The time-window filter (resolved_at > NOW() - INTERVAL '24 hours') introduces a subtlety: rows can age out of the window without any DML. This is handled by pg_trickle's time-window eviction logic, which runs during each refresh cycle to evict rows that have moved outside their window bounds.
Multi-Table Aggregates With Joins
The place where batch ETL is hardest to replace is multi-table aggregation — computing aggregates over data that spans multiple source tables.
-- Per-author engagement metrics: articles + reactions + comments
SELECT pgtrickle.create_stream_table(
name => 'author_engagement',
query => $$
SELECT
u.id AS author_id,
u.display_name,
COUNT(DISTINCT a.id) AS article_count,
SUM(r.count) AS total_reactions,
COUNT(DISTINCT c.id) AS total_comments,
ROUND(SUM(r.count)::numeric /
NULLIF(COUNT(DISTINCT a.id), 0), 2) AS avg_reactions_per_article
FROM users u
LEFT JOIN articles a ON a.author_id = u.id
AND a.published = true
LEFT JOIN article_reactions r ON r.article_id = a.id
LEFT JOIN comments c ON c.article_id = a.id
GROUP BY u.id, u.display_name
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
When a reaction is added to an article by author 42:
- The CDC trigger on
article_reactionsfires, recording the INSERT. - The DVM engine evaluates the join delta: this reaction belongs to an article by author 42.
- The engine updates the
author_id = 42row inauthor_engagement:total_reactions += 1, recomputeavg_reactions_per_article.
No other rows are touched. Author 43's metrics are unaffected.
This is the join-propagation problem that makes hand-rolled incremental ETL so fragile: the change in article_reactions affects a row in the aggregate keyed by author_id, through an intermediate join on articles. The DVM engine handles this join-delta propagation automatically.
Vector Aggregates: vector_avg
Here's where things get interesting for ML workloads.
The same algebraic principle that applies to AVG(price) applies to AVG(embedding) — a vector average is just an element-wise sum divided by a count. It's the same delta function applied to each dimension.
Arriving in v0.37:
-- Per-user taste vector: average embedding of liked items
SELECT pgtrickle.create_stream_table(
name => 'user_taste',
query => $$
SELECT
ul.user_id,
vector_avg(i.embedding) AS taste_vec,
COUNT(*) AS like_count,
MAX(ul.liked_at) AS last_interaction
FROM user_likes ul
JOIN items i ON i.id = ul.item_id
GROUP BY ul.user_id
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
CREATE INDEX ON user_taste USING hnsw (taste_vec vector_cosine_ops);
When user 42 likes item 1701:
- CDC captures the INSERT on
user_likes. - The DVM engine retrieves
item 1701's embedding. - Delta:
new_sum_vec = old_sum_vec + item_1701.embedding,new_count = old_count + 1,new_taste_vec = new_sum_vec / new_count. - One UPDATE to
user_tasteforuser_id = 42.
One million users, thousands of likes per second. Each refresh cycle touches only the users whose likes changed in that cycle. The HNSW index on taste_vec receives targeted updates, not a full rebuild.
This replaces the "recompute all user taste vectors every night" batch job with a continuously maintained table. Users get personalization that reflects their most recent action, not last night's.
Percentile and Rank Aggregates: The Hard Cases
Not every aggregate has a clean delta function. Some require knowing the full distribution.
PERCENTILE_CONT / PERCENTILE_DISC: Median and other percentiles require sorted access to the full set of values. There's no O(1) way to update a percentile when a single value changes. These always require a full recomputation.
RANK() / DENSE_RANK(): Inserting one row can change the rank of every other row. This is an O(n) update in the worst case.
NTILE(): Same problem as rank.
For these, pg_trickle falls back to refresh_mode => 'FULL' — which is still useful if you want scheduled maintenance, monitoring, and the other benefits of stream tables, just without the differential speedup.
SELECT pgtrickle.create_stream_table(
name => 'customer_revenue_ranks',
query => $$
SELECT
customer_id,
total_revenue,
RANK() OVER (ORDER BY total_revenue DESC) AS revenue_rank
FROM customer_totals
$$,
schedule => '1 minute',
refresh_mode => 'FULL' -- RANK() requires full recompute
);
This is still better than an unmanaged materialized view: the refresh is scheduled, monitored, and the table's freshness is visible via pgtrickle.stream_table_status().
Monitoring What You've Built
Once stream tables exist, pg_trickle exposes their state through a monitoring view:
SELECT
name,
refresh_mode,
last_refresh_at,
EXTRACT(EPOCH FROM (NOW() - last_refresh_at))::int AS staleness_secs,
rows_changed_last_cycle,
avg_refresh_ms,
schedule
FROM pgtrickle.stream_table_status()
ORDER BY staleness_secs DESC;
This is what replaces the ETL job monitoring dashboard. You're not watching a Celery worker queue or a Kafka consumer lag counter — you're watching a PostgreSQL view that reports directly from the extension's catalog. One SELECT tells you everything.
The Tradeoff
Incremental aggregation inside PostgreSQL is not free. The DVM engine adds overhead to the source write path — CDC triggers add roughly 50–200 microseconds per modified row, depending on how many stream tables reference that table.
For high-write workloads (thousands of writes per second), this overhead is worth measuring. pg_trickle's backpressure mechanism can be configured to shed load by falling back to full refresh when the write rate exceeds a configurable threshold.
But for the vast majority of OLTP workloads — hundreds to low thousands of writes per second — the overhead is negligible, and the elimination of the external ETL layer more than compensates.
The ETL pipeline that takes 40 minutes of engineering per incident and lives in a separate repository, separate monitoring system, and separate deployment pipeline? That's gone. The aggregate data is where it should have been all along: in the database, maintained by the database, queryable with SQL.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Incremental Full-Text Search with tsvector
Maintain ranked search results incrementally as documents change — without re-indexing the corpus or reaching for Elasticsearch
Full-text search in PostgreSQL is remarkably capable. The tsvector type, GIN indexes, and ts_rank function give you tokenization, stemming, positional matching, and relevance ranking — all inside the database. What PostgreSQL doesn't give you is incremental search result maintenance. If you have a materialized view that pre-computes ranked search results for popular queries, that view becomes stale the moment a document is inserted or updated. Refreshing it means re-ranking the entire corpus.
pg_trickle bridges this gap. By maintaining search result views as stream tables, you get search results that update incrementally as documents change. A new blog post appears in search results within seconds of being published. An updated product description immediately affects its ranking. A deleted page disappears from results without waiting for a nightly re-index.
The Search Materialization Problem
Consider a content platform with millions of articles. Users search frequently, and the same queries repeat (long-tail distribution). The application pre-computes search results for the top 1,000 queries:
-- Traditional approach: materialized view of pre-ranked results
CREATE MATERIALIZED VIEW search_cache AS
SELECT
q.query_text,
a.id AS article_id,
a.title,
a.summary,
ts_rank(a.search_vector, plainto_tsquery(q.query_text)) AS relevance
FROM popular_queries q
CROSS JOIN LATERAL (
SELECT *
FROM articles
WHERE search_vector @@ plainto_tsquery(q.query_text)
ORDER BY ts_rank(search_vector, plainto_tsquery(q.query_text)) DESC
LIMIT 50
) a;
This materialized view holds the top 50 results for each popular query. Refreshing it means re-executing 1,000 full-text searches across the entire articles table. For a corpus of 5 million articles, that's an expensive operation — even with GIN indexes, it takes seconds to minutes depending on query complexity.
But when a single article is added or updated, only the queries that match that article need their results updated. If the new article matches 3 of the 1,000 popular queries, only 3 result sets need adjustment. The other 997 are unchanged.
Search Results as a Stream Table
-- Articles with pre-computed tsvector
CREATE TABLE articles (
id serial PRIMARY KEY,
title text NOT NULL,
body text NOT NULL,
search_vector tsvector GENERATED ALWAYS AS (
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', body), 'B')
) STORED,
published_at timestamptz DEFAULT now(),
author_id integer
);
CREATE INDEX ON articles USING gin(search_vector);
-- Popular queries to maintain results for
CREATE TABLE tracked_queries (
id serial PRIMARY KEY,
query_text text NOT NULL UNIQUE,
tsquery tsquery GENERATED ALWAYS AS (plainto_tsquery('english', query_text)) STORED
);
-- Stream table: live search results
SELECT pgtrickle.create_stream_table(
'live_search_results',
$$
SELECT
tq.id AS query_id,
tq.query_text,
a.id AS article_id,
a.title,
ts_rank(a.search_vector, tq.tsquery) AS relevance,
a.published_at
FROM tracked_queries tq
JOIN articles a ON a.search_vector @@ tq.tsquery
$$
);
When a new article is published, pg_trickle processes the insertion incrementally. It evaluates which tracked queries the article matches (using the GIN index), computes the relevance score for each match, and inserts the corresponding result rows. Articles that don't match any tracked query produce no delta at all.
For a corpus of 5 million articles with 1,000 tracked queries, publishing a single article might match 5–10 queries. The incremental cost is 5–10 index probes and rank computations — compared to 1,000 full searches for a complete refresh.
Handling Document Updates
When an article's body or title changes, its search_vector is recomputed (via the GENERATED column). This change propagates to the stream table:
- The old search_vector matched a set of tracked queries — those result rows are removed (weight -1)
- The new search_vector matches a (possibly different) set of queries — those result rows are inserted (weight +1)
- For queries matched by both old and new vectors, the relevance score might change — this appears as a remove + re-insert
The net effect: search results instantly reflect the updated content. If editing an article's title changes its relevance for "machine learning" from 0.42 to 0.67, the stream table's relevance column for that article-query pair updates accordingly.
Top-K Result Ranking
The stream table above maintains all matching results for each query. In practice, you want the top 50 or top 100. You can layer a top-K selection on top:
SELECT pgtrickle.create_stream_table(
'top_search_results',
$$
SELECT *
FROM (
SELECT
query_id,
query_text,
article_id,
title,
relevance,
published_at,
ROW_NUMBER() OVER (PARTITION BY query_id ORDER BY relevance DESC) AS rank
FROM live_search_results
) ranked
WHERE rank <= 50
$$
);
This cascade maintains the top 50 results per query incrementally. When a new article enters the results with high relevance, it displaces the 50th-ranked article. When an article's relevance drops (due to an edit), it might fall out of the top 50 and be replaced by the next candidate.
Faceted Search Counts
E-commerce search results typically include facet counts — "Electronics (234), Books (89), Clothing (156)." These counts change as products are added, removed, or recategorized.
SELECT pgtrickle.create_stream_table(
'search_facets',
$$
SELECT
tq.id AS query_id,
tq.query_text,
p.category,
COUNT(*) AS match_count,
AVG(p.price) AS avg_price,
MIN(p.price) AS min_price
FROM tracked_queries tq
JOIN products p ON p.search_vector @@ tq.tsquery
GROUP BY tq.id, tq.query_text, p.category
$$
);
Adding a new product in "Electronics" that matches the query "wireless headphones" increments the Electronics count for that query by one. The facet counts are always accurate — no stale cached counts showing 234 when the actual count is 237.
Multi-Language Search
For applications with multilingual content, you might maintain separate tsvectors per language and combine results:
SELECT pgtrickle.create_stream_table(
'multilingual_search',
$$
SELECT
tq.id AS query_id,
a.id AS article_id,
a.title,
a.language,
CASE a.language
WHEN 'english' THEN ts_rank(a.search_vector_en, plainto_tsquery('english', tq.query_text))
WHEN 'german' THEN ts_rank(a.search_vector_de, plainto_tsquery('german', tq.query_text))
WHEN 'french' THEN ts_rank(a.search_vector_fr, plainto_tsquery('french', tq.query_text))
END AS relevance
FROM tracked_queries tq
JOIN articles a ON
(a.language = 'english' AND a.search_vector_en @@ plainto_tsquery('english', tq.query_text)) OR
(a.language = 'german' AND a.search_vector_de @@ plainto_tsquery('german', tq.query_text)) OR
(a.language = 'french' AND a.search_vector_fr @@ plainto_tsquery('french', tq.query_text))
$$
);
Each language's articles are independently maintained. A new German article only triggers delta processing for the German text path. French and English results are untouched.
When to Use This vs. Elasticsearch
Elasticsearch (or OpenSearch, Typesense, Meilisearch) gives you:
- Sub-millisecond query latency across massive corpora
- Sophisticated relevance tuning (BM25, custom scoring)
- Distributed indexing across many nodes
- Fuzzy matching, synonyms, phonetic analysis
PostgreSQL + pg_trickle gives you:
- Transactional consistency (search results reflect the committed state)
- No synchronization lag (no "indexed after 5 seconds" delay)
- No separate infrastructure to operate
- Joins between search results and relational data
- Incremental maintenance of pre-computed result sets
The sweet spot for pg_trickle search is: you have a moderate corpus (up to ~10 million documents), a known set of popular or tracked queries, and you need results to be immediately consistent with the database state. This covers internal tools, admin panels, product catalogs, content management systems, and documentation sites.
If you need to search 100 million documents with sub-50ms latency and fuzzy matching, Elasticsearch is the right tool. But if you're running Elasticsearch just to avoid stale search results on a 2-million-document corpus, pg_trickle eliminates that infrastructure entirely.
Performance Characteristics
| Scenario | Full refresh (all queries) | Incremental (1 document change) |
|---|---|---|
| 100 tracked queries, 1M docs | 3.2s | 8ms |
| 1,000 tracked queries, 5M docs | 28s | 15ms |
| 10,000 tracked queries, 10M docs | 4.5min | 45ms |
The incremental cost scales with the number of queries the changed document matches — typically 5–20 out of thousands. This makes it practical to maintain live search results for large query sets without the operational cost of a separate search engine.
Your full-text search is already in PostgreSQL. pg_trickle makes the results live. No separate search index, no synchronization delay, no stale results.
← Back to Blog Index | Documentation
Incremental ML Feature Engineering in PostgreSQL
Replace your nightly feature store batch job with continuously fresh features maintained as stream tables
Machine learning models are only as good as their features. And features are only as good as their freshness. A fraud detection model trained on "average transaction amount in the last 7 days" is useless if that feature was computed yesterday and the fraudster has been active for the last 6 hours with abnormally large transactions.
The standard ML feature engineering pipeline looks like this: a scheduled job (Airflow, dbt, cron) runs every hour or every day, reads raw data, computes derived features, and writes them to a feature store. The model reads features from the store at inference time. The lag between raw data and computed features ranges from minutes to hours, depending on your pipeline's schedule and execution time.
pg_trickle collapses this pipeline into a single layer. Features are defined as SQL queries over your operational data, maintained as stream tables that update incrementally as the underlying data changes. The feature store is just a set of materialized views that are always fresh. No pipeline orchestration, no batch jobs, no staleness window.
Features as Stream Tables
Consider a fraud detection model that uses these features per customer:
- Average transaction amount (last 7 days)
- Transaction count (last 24 hours)
- Number of distinct merchants (last 7 days)
- Maximum single transaction (last 30 days)
- Standard deviation of transaction amounts (last 7 days)
- Ratio of current transaction to historical average
Each of these is a SQL aggregate over the transactions table with a time window. Traditionally, you'd compute them in a batch job:
-- Traditional batch feature computation (runs hourly via Airflow)
INSERT INTO customer_features
SELECT
customer_id,
AVG(amount) FILTER (WHERE created_at > now() - interval '7 days') AS avg_amount_7d,
COUNT(*) FILTER (WHERE created_at > now() - interval '24 hours') AS txn_count_24h,
COUNT(DISTINCT merchant_id) FILTER (WHERE created_at > now() - interval '7 days') AS distinct_merchants_7d,
MAX(amount) FILTER (WHERE created_at > now() - interval '30 days') AS max_amount_30d,
STDDEV(amount) FILTER (WHERE created_at > now() - interval '7 days') AS stddev_amount_7d
FROM transactions
GROUP BY customer_id;
This scans the entire transactions table every hour. For 100 million transactions across 5 million customers, that's a multi-minute full table scan every hour. And between runs, the features are stale.
As stream tables:
SELECT pgtrickle.create_stream_table(
'customer_features_7d',
$$
SELECT
customer_id,
AVG(amount) AS avg_amount,
COUNT(*) AS txn_count,
COUNT(DISTINCT merchant_id) AS distinct_merchants,
STDDEV(amount) AS stddev_amount,
MAX(amount) AS max_amount
FROM transactions
WHERE created_at > now() - interval '7 days'
GROUP BY customer_id
$$
);
Every new transaction incrementally updates the affected customer's features. The cost per transaction is constant — one group update — regardless of how many historical transactions exist. Feature freshness drops from hours to seconds.
Rolling Window Features
Time-windowed aggregates are the bread and butter of ML feature engineering. "Sum of deposits in the last 30 days," "count of logins in the last hour," "moving average of price over 20 periods."
pg_trickle handles these by tracking both additions (new rows entering the window) and removals (old rows falling out of the window). When a new transaction arrives, it's added to the aggregate. When the window advances past an old transaction, it's subtracted.
-- Transaction velocity: count and sum in sliding 1-hour window
SELECT pgtrickle.create_stream_table(
'customer_velocity_1h',
$$
SELECT
customer_id,
COUNT(*) AS txn_count_1h,
SUM(amount) AS total_amount_1h,
MAX(amount) AS max_amount_1h
FROM transactions
WHERE created_at > now() - interval '1 hour'
GROUP BY customer_id
$$
);
For fraud detection, velocity features are critical. A customer who normally makes 2 transactions per hour suddenly making 15 is a strong signal. With batch computation, you might not detect this until the next hourly run — by which time the fraudster has already completed all 15 transactions and disappeared. With incremental maintenance, the velocity feature updates after each transaction, and the model can score in real time.
Lag Features and Sequential Patterns
ML models for time-series prediction often use lag features: "what was the value N steps ago?" In SQL:
SELECT pgtrickle.create_stream_table(
'customer_transaction_lags',
$$
SELECT
customer_id,
amount AS latest_amount,
LAG(amount, 1) OVER w AS prev_amount_1,
LAG(amount, 2) OVER w AS prev_amount_2,
LAG(amount, 3) OVER w AS prev_amount_3,
amount - LAG(amount, 1) OVER w AS amount_delta,
created_at - LAG(created_at, 1) OVER w AS time_since_last
FROM transactions
WINDOW w AS (PARTITION BY customer_id ORDER BY created_at)
$$
);
Window functions with LAG and LEAD are maintained incrementally by pg_trickle. When a new transaction arrives for a customer, the lag values shift: the previous "latest" becomes prev_amount_1, the previous prev_amount_1 becomes prev_amount_2, and so on. Only the affected customer's row is updated.
Cross-Entity Features
Some of the most powerful features relate an entity to its peers. "How does this customer's spending compare to the average for their cohort?" "Is this merchant's chargeback rate above the industry median?"
-- Merchant risk features: how does each merchant compare to peers?
SELECT pgtrickle.create_stream_table(
'merchant_risk_features',
$$
SELECT
m.merchant_id,
m.category,
COUNT(t.id) AS total_transactions,
SUM(CASE WHEN t.is_chargeback THEN 1 ELSE 0 END) AS chargeback_count,
SUM(CASE WHEN t.is_chargeback THEN 1 ELSE 0 END)::float
/ NULLIF(COUNT(t.id), 0) AS chargeback_rate,
AVG(t.amount) AS avg_transaction_amount
FROM merchants m
LEFT JOIN transactions t ON t.merchant_id = m.merchant_id
AND t.created_at > now() - interval '30 days'
GROUP BY m.merchant_id, m.category
$$
);
When a chargeback is recorded, only the affected merchant's features are recomputed. When you need to compare a merchant to its category average, that's another stream table on top:
SELECT pgtrickle.create_stream_table(
'merchant_category_benchmarks',
$$
SELECT
category,
AVG(chargeback_rate) AS category_avg_chargeback_rate,
AVG(avg_transaction_amount) AS category_avg_txn_amount,
COUNT(*) AS merchants_in_category
FROM merchant_risk_features
GROUP BY category
$$
);
The cascade maintains both the per-merchant features and the category benchmarks incrementally. A single chargeback event propagates through: transaction → merchant features → category benchmark. Total cost: a few milliseconds.
Feature Freshness vs. Feature Stores
Traditional feature stores (Feast, Tecton, Hopsworks) optimize for serving pre-computed features at low latency. They're excellent at that specific job. But they introduce a batch computation → store → serve pipeline that adds latency and operational complexity.
| Aspect | Batch feature store | pg_trickle stream tables |
|---|---|---|
| Feature freshness | Minutes to hours (batch interval) | Seconds (refresh interval) |
| Infrastructure | Airflow + Spark/dbt + Redis/DynamoDB | PostgreSQL (already have it) |
| Feature definition | Python/SQL in DAG configs | SQL (stream table definition) |
| Backfill new features | Full historical recompute | Initial materialization + incremental |
| Point-in-time correctness | Complex (time-travel logic) | Automatic (SQL windowing) |
The key insight is that if your features can be expressed as SQL aggregates (and most can), then they can be maintained incrementally inside the database. You don't need a separate compute layer, a separate storage layer, and a separate serving layer. The database is all three.
Real-Time Scoring Pipeline
The ultimate payoff of incremental features is real-time scoring. Instead of looking up pre-computed (stale) features at inference time, the model reads features that are current as of the last transaction:
# At inference time: features are always fresh
def score_transaction(customer_id: str, amount: float) -> float:
# Features are maintained incrementally by pg_trickle
features = db.execute("""
SELECT avg_amount, txn_count, distinct_merchants, stddev_amount
FROM customer_features_7d
WHERE customer_id = %s
""", [customer_id]).fetchone()
# Score with fresh features
return model.predict([
amount,
features.avg_amount,
features.txn_count,
features.distinct_merchants,
features.stddev_amount,
amount / features.avg_amount if features.avg_amount > 0 else 0,
])
The features in customer_features_7d reflect all transactions up to the most recent refresh (typically seconds ago). No feature store lookup, no cache invalidation, no staleness. Just a table read.
Getting Started
-- Your transaction table (already exists in most systems)
CREATE TABLE transactions (
id bigserial PRIMARY KEY,
customer_id text NOT NULL,
merchant_id text NOT NULL,
amount numeric(12,2),
created_at timestamptz DEFAULT now(),
is_chargeback boolean DEFAULT false
);
-- Define features as a stream table
SELECT pgtrickle.create_stream_table(
'fraud_features',
$$
SELECT
customer_id,
COUNT(*) AS total_txns,
AVG(amount) AS avg_amount,
STDDEV(amount) AS amount_stddev,
MAX(amount) AS max_amount,
COUNT(DISTINCT merchant_id) AS unique_merchants
FROM transactions
WHERE created_at > now() - interval '7 days'
GROUP BY customer_id
$$
);
-- Features update automatically as transactions flow in
INSERT INTO transactions (customer_id, merchant_id, amount)
VALUES ('cust_123', 'merch_456', 299.99);
SELECT pgtrickle.refresh_stream_table('fraud_features');
-- cust_123's features are now current
Your feature store is a stream table. Your feature pipeline is CREATE STREAM TABLE. Your feature freshness is the refresh interval. It's that simple.
Stop waiting for batch jobs to compute stale features. Let the database maintain them incrementally, and let your models score with fresh data.
← Back to Blog Index | Documentation
Incremental PageRank and Graph Analytics in SQL
Live graph metrics without a graph database — just PostgreSQL and pg_trickle
Graph problems hide in relational data. Every time you have a follows table, a transfers table, or a references table, you have a graph. And you probably need to answer questions about it: who are the most influential nodes? Which clusters are forming? How many hops separate two entities?
The traditional answer is to export your data to Neo4j or JanusGraph, run your algorithm, and import the results back. This works fine until you need the answers to be fresh. Once you want live PageRank scores that update when a single edge changes, the export-compute-import cycle becomes a bottleneck that no amount of Kafka topics can hide.
pg_trickle offers a different path. By expressing graph algorithms as recursive SQL and maintaining the results incrementally, you can keep PageRank, connected components, and shortest-path metrics live inside PostgreSQL — updated within milliseconds of the underlying edge changes.
PageRank as SQL
PageRank assigns every node in a graph a score based on how many other nodes point to it, weighted by how important those pointing nodes are. The original Google paper describes it as an iterative computation: start with uniform scores, then repeatedly distribute each node's score to its outgoing neighbors, until the scores converge.
In SQL, a single iteration looks like this:
-- edges(src, dst) is our graph
-- scores(node, rank) holds the current PageRank values
SELECT
e.dst AS node,
SUM(s.rank / out_degree.degree) * 0.85 + 0.15 / :num_nodes AS rank
FROM edges e
JOIN scores s ON s.node = e.src
JOIN (
SELECT src, COUNT(*) AS degree
FROM edges
GROUP BY src
) out_degree ON out_degree.src = e.src
GROUP BY e.dst;
Each iteration redistributes rank along edges. After 10–20 iterations, scores converge for most real-world graphs. The expensive part is that each iteration reads the entire edges table and the entire scores table. For a graph with 50 million edges, that's 50 million rows per iteration, 10 iterations — half a billion row reads for a single PageRank computation.
Now consider what happens when a single edge is added. One new follower, one new citation, one new hyperlink. The full recomputation reads half a billion rows to account for one change. That ratio — 500 million to 1 — is exactly the inefficiency that incremental maintenance eliminates.
Making It Incremental
With pg_trickle, you define the PageRank computation as a stream table:
SELECT pgtrickle.create_stream_table(
'pagerank_scores',
$$
SELECT
e.dst AS node,
SUM(s.rank / od.degree) * 0.85 + 0.15 / (SELECT COUNT(DISTINCT src) FROM edges) AS rank
FROM edges e
JOIN node_scores s ON s.node = e.src
JOIN out_degrees od ON od.src = e.src
GROUP BY e.dst
$$
);
When an edge is inserted into the edges table, pg_trickle's differential engine computes the cascading effect. The new edge increases the destination node's rank. That increase then propagates to nodes that the destination points to. The propagation is bounded — after a few hops, the delta becomes negligible and is truncated.
The key insight is that a single edge change only affects a small cone of the graph. In a graph with 50 million edges, adding one edge might touch 100 nodes in the first hop, 500 in the second, and by the third hop the deltas are below the convergence threshold. Instead of reading 500 million rows, the incremental update touches a few thousand. That's the difference between seconds and microseconds.
Connected Components
Connected components answer the question: which nodes can reach which other nodes? It's the foundation for fraud ring detection, social community discovery, and network partition analysis.
The classic algorithm is union-find, but in SQL it's naturally expressed as a fixed-point iteration:
-- Start: each node is its own component (the minimum node ID reachable)
-- Iterate: each node adopts the minimum component ID of its neighbors
SELECT
node,
LEAST(component, MIN(neighbor_component)) AS component
FROM (
SELECT e.src AS node, c.component, cn.component AS neighbor_component
FROM edges e
JOIN components c ON c.node = e.src
JOIN components cn ON cn.node = e.dst
) sub
GROUP BY node, component;
When maintained incrementally, adding an edge between two previously disconnected components triggers a merge — but only for the nodes in those two components. The rest of the graph is untouched. If you have 10,000 components and merge two of them, only the nodes in those two components see an update. The other 9,998 components are not re-examined.
This makes it practical to maintain connected components over graphs with millions of nodes in real time. A fraud detection system can maintain clusters of suspicious accounts and see new connections immediately, rather than running a nightly batch job and discovering the ring twelve hours too late.
Shortest Paths and Hop Counts
Shortest-path queries are traditionally expensive because they require graph traversal. But for many use cases, you don't need arbitrary shortest paths — you need precomputed hop counts for frequently queried pairs.
-- Maintain shortest-path distances for key node pairs
SELECT
src,
dst,
MIN(hops) AS shortest_path
FROM (
-- Direct edges: 1 hop
SELECT src, dst, 1 AS hops FROM edges
UNION ALL
-- Two-hop paths
SELECT e1.src, e2.dst, 2 AS hops
FROM edges e1
JOIN edges e2 ON e2.src = e1.dst
UNION ALL
-- Three-hop paths
SELECT e1.src, e3.dst, 3 AS hops
FROM edges e1
JOIN edges e2 ON e2.src = e1.dst
JOIN edges e3 ON e3.src = e2.dst
) paths
GROUP BY src, dst;
Maintained incrementally, a new edge potentially creates shorter paths. But it only affects paths that pass through the new edge's endpoints. For a bounded hop count (say, up to 3 hops), the incremental update is local and fast.
The Performance Story
We benchmarked incremental PageRank against full recomputation on a synthetic social graph:
| Graph size | Full recompute | Incremental (single edge) | Speedup |
|---|---|---|---|
| 1M edges | 2.8s | 4ms | 700× |
| 10M edges | 31s | 12ms | 2,583× |
| 50M edges | 168s | 28ms | 6,000× |
The pattern is clear: as the graph grows, incremental maintenance becomes proportionally more valuable. The full recompute scales linearly with graph size. The incremental update scales with the size of the affected neighborhood, which barely grows as the graph gets larger (due to the damping factor truncating propagation).
When You Don't Need a Graph Database
Graph databases excel at arbitrary traversals — "find all paths between Alice and Bob with at most 5 hops through nodes labeled 'company'." If your workload is dominated by ad-hoc, variable-length traversals, a dedicated graph database is the right tool.
But if your workload is more structured — maintaining PageRank, monitoring connected components, tracking hop counts for known patterns — then you're really running a fixed computation that should be maintained incrementally. That's exactly what pg_trickle does. You get graph analytics without a second database, without ETL pipelines, without synchronization headaches.
Your data is already in PostgreSQL. Your application already connects to PostgreSQL. The graph is already there in your foreign keys. You just need to start maintaining the answers.
Getting Started
-- Create the edges table
CREATE TABLE edges (src bigint, dst bigint);
CREATE INDEX ON edges (src);
CREATE INDEX ON edges (dst);
-- Create the out-degree stream table
SELECT pgtrickle.create_stream_table(
'out_degrees',
'SELECT src, COUNT(*) AS degree FROM edges GROUP BY src'
);
-- Create the PageRank stream table
SELECT pgtrickle.create_stream_table(
'pagerank',
$$
SELECT
e.dst AS node,
SUM(1.0 / od.degree) * 0.85 + 0.15 AS rank
FROM edges e
JOIN out_degrees od ON od.src = e.src
GROUP BY e.dst
$$
);
-- Insert some edges
INSERT INTO edges VALUES (1, 2), (1, 3), (2, 3), (3, 1), (4, 3);
-- Refresh and check scores
SELECT pgtrickle.refresh_stream_table('pagerank');
SELECT * FROM pagerank ORDER BY rank DESC;
The scores update incrementally. Add a million more edges, and each subsequent refresh only processes the new ones. Your PageRank stays fresh at a cost proportional to what changed — not proportional to your entire graph.
pg_trickle turns your relational database into a live graph analytics engine. No export pipelines, no second database, no stale results.
← Back to Blog Index | Documentation
Your pgvector Index Is Lying to You
How pg_trickle + pgvector keep your embeddings honest
You built a RAG pipeline. You embedded your documents with OpenAI. You built an HNSW index with pgvector. Your queries are fast. Life is good.
Then three weeks later, a user searches for something that was definitely added last Tuesday. Your system returns a result from 2023.
Welcome to the silent embedding staleness problem — the operational reality that nobody talks about when they're showing you pgvector benchmarks.
This is the story of what actually goes wrong with pgvector in production, why your index degrades over time without telling you, and how combining pgvector with a PostgreSQL-native incremental view engine (pg_trickle) closes every one of those gaps.
The Embedding Pipeline Most People Build
The typical setup looks like this:
- Documents live in a
documentstable in PostgreSQL. - You run a nightly job that finds "documents added or changed since yesterday," sends them to an embedding API, and writes the vectors back to a
document_embeddingstable. - That table has an HNSW index (because you read the pgvector README).
- Your application queries
ORDER BY embedding <=> $1 LIMIT 10.
It works. It's fast. And it has at least four silent failure modes that will bite you.
Silent Failure Mode 1: Embeddings Go Stale
Between your batch job runs, every document change is invisible to your vector search. If your batch runs nightly, the maximum staleness is 23 hours and 59 minutes. If the batch job fails, it's longer.
You won't know. The search will still return results. They'll just be wrong.
The standard fix is to run your batch more often — hourly, every 15 minutes. But this hits a wall: running the full embedding pipeline over every changed document is expensive (API calls cost money), slow (you're deserializing thousands of rows), and fragile (you need to track what changed, handle failures, deduplicate, and guarantee no data loss).
Most teams end up with a homebrewed CDC (change data capture) queue, a worker that polls it, retry logic, and a Slack alert for when it falls behind. That's a lot of infrastructure to maintain for something that is, fundamentally, a derived-data freshness problem.
Silent Failure Mode 2: Your Index Is Wrong
Let me explain something about IVFFlat indexes that doesn't appear in introductory tutorials.
IVFFlat works by clustering your vectors at build time into lists groups (usually 100–1000). At query time, it searches only the nearest probes clusters (usually 1–10% of the total). This is what makes it fast.
The problem is that the cluster assignments are computed from the data at build time. When you add new documents with different topics (a new product category, a recent news event, a new code language), those vectors don't fit neatly into the existing clusters. IVFFlat puts them in the nearest cluster anyway, but the recall degrades — silently.
The pgvector documentation recommends: "Don't build the index until you have enough data. Rebuild on a schedule."
But what schedule? How do you know when to rebuild? How do you automate it without rebuilding unnecessarily? And how do you REINDEX a 10-million-row table without taking your search offline?
Most teams either never rebuild (recall degrades slowly over months), rebuild on a fixed weekly schedule (blunt instrument), or get paged when someone notices quality has dropped.
HNSW has a related problem: deletions create tombstones. HNSW never actually removes deleted nodes from the graph — it marks them. After enough deletions, your index is navigating through a graveyard. Build operations slow down. Query recall drops.
Silent Failure Mode 3: You're Not Searching What You Think You're Searching
Here's something that sounds obvious but causes real pain: the table your HNSW index sits on is not your documents table. It's a denormalization of it.
Your real question to a semantic search system is something like: "Find me the 10 most relevant documents, from active projects, that my user has permission to see, with their associated category names and last-updated timestamps."
That requires joining documents → document_permissions → categories → projects. But pgvector indexes a single table. So you either:
Option A: Index the raw documents table, then filter post-query. Except filtering after the ANN search means you're retrieving k * overhead candidates from the index and hoping enough survive the filters. For fine-grained ACL filtering, you might need to retrieve 10× candidates to end up with 10 results.
Option B: Maintain a denormalized flat table manually — one row per document with all the fields joined. Except this flat table needs to stay synchronized with every source table that feeds it. Add a category? The flat table is wrong. Update a permission? The flat table is wrong. Your data engineering team writes a bunch of triggers and prays.
Option C: Just use Elasticsearch, which at least has an ETL ecosystem. But now you're maintaining two systems with two consistency models.
Silent Failure Mode 4: Aggregate Vectors Are Always Stale
If you're doing anything more sophisticated than flat document search — collaborative filtering, user preference vectors, cluster centroids for dimensionality reduction, category-representative embeddings — you're computing aggregate vectors.
-- The user's "taste" based on items they've interacted with
SELECT avg(item_embeddings.embedding)
FROM user_actions
JOIN item_embeddings ON item_embeddings.item_id = user_actions.item_id
WHERE user_actions.user_id = 42;
This query is fine for a single user. For a million users, you precompute it and store the results. But "precompute and store" means you need to recompute whenever a user takes a new action. Back to the batch-job problem.
Every time someone likes an item, their taste vector is stale until the next batch. The longer your batch interval, the more personalization debt you accumulate.
What pg_trickle Actually Does
pg_trickle is a PostgreSQL extension that implements incremental view maintenance — IVM for short.
The idea is simple in principle: instead of recomputing a derived table from scratch every time, figure out what changed and apply only that change. If a user likes one new item, compute the difference to their taste vector from that one new item, and apply it. Don't scan their entire history.
In practice, this is mathematically hard. Differential dataflow — the theory underlying pg_trickle's engine — was developed at Microsoft Research in the 2010s and is still an active research area. The key insight is that for certain classes of queries (the ones that matter in practice: JOINs, GROUP BYs, aggregates, filters, window functions), you can express the "how does the result change if one input row changes?" question as a formal algebraic operation.
For SUM(x): if you add a row with x = 5, the new sum is old_sum + 5. You don't need to re-scan.
For COUNT(*): if you delete a row, the new count is old_count - 1.
For AVG(embedding) over all items a user has liked: if the user likes one new item with embedding v, the new average is (old_sum_vector + v) / (old_count + 1). No re-scan, no batch job.
This works because these operations have well-defined inverses. The math for vector aggregation is actually straightforward: a vector mean is a sum divided by a count, and sums are algebraically invertible.
What makes pg_trickle different from just writing clever UPDATE queries is that the engine handles this automatically, across arbitrary SQL queries, including multi-table JOINs and nested aggregations.
The Problem That Makes This Hard (And How pg_trickle Solves It)
The hard part of IVM isn't the aggregate math — it's figuring out what changed.
When someone inserts a row into user_actions, pg_trickle needs to know:
- Which stream tables depend on
user_actions - Which groups in the stream table are affected (only this user's taste vector)
- What the correct delta is for each affected group
- How to apply that delta without violating consistency
pg_trickle handles this with a trigger-based CDC pipeline. When you create a stream table, pg_trickle attaches AFTER INSERT/UPDATE/DELETE triggers to every source table in the query. These triggers write changes to per-source change buffers (in a pgtrickle_changes schema) as part of the original transaction.
This means change capture is:
- Transactional. If the original write rolls back, the change record is gone too.
- Low-latency. The buffer is populated in microseconds.
- Correct under concurrency. Because it's within the same transaction, you can't have a change that gets captured but whose write later fails.
The scheduler then picks up these change buffers on a configurable cadence and applies them through the DVM (differential view maintenance) engine, which computes the incremental delta, then applies it to the stream table via a single MERGE statement.
The stream table is a normal PostgreSQL table. It has all the usual guarantees: WAL-logged, MVCC-correct, ACID-transactional. You can index it any way you like, including HNSW.
The pgvector + pg_trickle Pattern
Here's where it clicks.
The problems pgvector has in production — embedding staleness, index drift, denormalization complexity, aggregate vector maintenance — are all instances of the same underlying problem: derived data that needs to stay synchronized with source data.
That is exactly what pg_trickle was built to solve.
Always-Fresh Embeddings
-- You define the relationship once.
-- pg_trickle handles the synchronization forever.
SELECT pgtrickle.create_stream_table(
name => 'docs_embedded',
query => $$
SELECT d.id, d.title, d.body,
d.embedding, -- pre-computed by your app or pgai
d.project_id, d.updated_at
FROM documents d
WHERE d.status = 'active'
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
CREATE INDEX ON docs_embedded USING hnsw (embedding vector_cosine_ops);
Now when a document body changes (and your application writes the new embedding to the source table), pg_trickle's CDC trigger captures the change. Within the next refresh cycle (10 seconds by default), only the changed document's row is updated in docs_embedded. The HNSW index receives a precise insert+delete pair. No batch job. No queue. No worker. No drift.
The important word here is DIFFERENTIAL. The engine doesn't recompute the entire corpus. It processes only the rows that changed since the last cycle. If 1 document changes out of 1 million, it touches 1 row's worth of work.
Note on embedding generation: The example above assumes your application (or a
pgaivectorizer worker) writes the embedding to the sourcedocumentstable before commit. pg_trickle's differential refresh reads the already-computed embedding — it doesn't call an embedding API during refresh. Calling a volatile function likepgai.embed()inside the stream table query would force a FULL refresh on every cycle, defeating the purpose. Keep embedding generation in your write path; keep corpus maintenance in the stream table.
Denormalized Corpora as First-Class Citizens
SELECT pgtrickle.create_stream_table(
name => 'search_corpus',
query => $$
SELECT
d.id,
d.title,
d.body,
d.embedding,
array_agg(DISTINCT t.name) AS tags,
p.name AS project_name,
u.email AS owner_email,
acl.read_roles AS allowed_roles,
d.updated_at
FROM documents d
JOIN projects p ON p.id = d.project_id
JOIN users u ON u.id = d.owner_id
LEFT JOIN doc_tags dt ON dt.doc_id = d.id
LEFT JOIN tags t ON t.id = dt.tag_id
JOIN doc_acl acl ON acl.doc_id = d.id
WHERE d.status = 'active'
GROUP BY d.id, p.name, u.email, acl.read_roles
$$,
schedule => '10 seconds'
);
This is not a view. This is a real table, updated incrementally. When a tag is added to a document, only that document's row is updated in search_corpus. When a project is renamed, only documents in that project update. When a permission changes, only the affected document's allowed_roles changes.
Your vector search now operates on a fully denormalized flat table with correct metadata — without a single ETL pipeline.
Centroid Maintenance with vector_avg
The vector_avg algebraic aggregate (arriving in v0.37.0) is one of the most powerful things in this story.
SELECT pgtrickle.create_stream_table(
name => 'user_taste',
query => $$
SELECT
ua.user_id,
vector_avg(e.embedding) AS taste_vec,
COUNT(*) AS interaction_count
FROM user_actions ua
JOIN item_embeddings e ON e.item_id = ua.item_id
WHERE ua.action = 'liked'
GROUP BY ua.user_id
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
CREATE INDEX ON user_taste USING hnsw (taste_vec vector_cosine_ops);
When user 42 likes a new item, the DVM engine computes:
new_taste = (old_sum_vector + new_item.embedding) / (old_count + 1)
Only user 42's row changes. The HNSW index on taste_vec receives one update. For a system with 10 million users where thousands interact per second, this is the difference between "runs at scale" and "falls over."
This is possible because avg(vector) is an algebraic aggregate — it has a well-defined incremental update rule. The same mathematics pg_trickle uses to maintain AVG(price) or SUM(revenue) applies directly to vector means.
The Reindex Problem, Solved Differently
Here's a more nuanced take on IVFFlat rebuild. The question isn't "when do I run REINDEX?" — it's "how do I know when the drift is bad enough to justify the rebuild cost?"
pg_trickle tracks the number of rows changed since the last reindex as a first-class catalog value. You set a drift threshold and a post-refresh action, and the engine handles the rest:
SELECT pgtrickle.alter_stream_table(
'docs_embedded',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.10 -- 10% of rows changed → rebuild
);
Internally: after each refresh, the scheduler checks rows_changed_since_last_reindex / total_rows. If it exceeds the threshold, it enqueues an async REINDEX job (non-blocking, using REINDEX CONCURRENTLY) in a lower-priority tier so it never delays your search.
And through the pgtrickle.vector_status() monitoring view, you can see the drift percentage, last reindex time, and embedding lag for every vector stream table in your system — as a live, incrementally-maintained view.
What Makes This Genuinely Unique
There are other approaches to this problem. Let's be honest about why they fall short.
The standard nightly batch job. Cheap to build, painful to maintain. Staleness measured in hours. Doesn't handle incremental aggregates. Breaks silently when the source schema changes.
Change-data-capture pipelines (Debezium + Kafka). Solves the freshness problem, but requires you to run Kafka, manage a Debezium connector, write a consumer that does the embedding logic, handle replication lag, coordinate schema changes across two systems, and ensure exactly-once semantics. It's a significant operational burden, and the embedding logic lives outside your database with no transactional guarantees.
Read-through caches (ReadySet, Materialize). These are purpose-built incremental-view systems, but they're separate processes, not PostgreSQL extensions. Your vector search lives in a different place from your transactional data. Schema changes have to be coordinated. You're back to a distributed system.
pg_ivm. The closest PostgreSQL-native alternative. But pg_ivm doesn't support complex queries (no aggregates in JOINs, no OUTER JOINs, limited GROUP BY). And it has no scheduler, no CDC pipeline, no vector-aggregate support. It's more of a research prototype than a production system.
Pinecone, Weaviate, Qdrant. These are purpose-built vector databases. They handle the indexing and search side well. But they have no SQL engine, no notion of derived data or incremental views, and require a synchronization pipeline from your source database to keep them fresh. You're back to the Debezium/ETL problem.
The unique property of pg_trickle + pgvector is that everything happens inside PostgreSQL, transactionally, automatically, with no external dependencies.
The change buffers are written in the same transaction as the source write. The refresh applies a MERGE to the stream table, which is WAL-logged. The HNSW index is updated by PostgreSQL's normal index AM callback, not by any special integration. The monitoring view is itself a stream table.
You cannot have a stale HNSW index while the underlying stream table is up-to-date. The index and the data are maintained by the same ACID engine.
A Concrete Example: A Company's Documentation Site
Let's make this concrete. Imagine you're building a developer documentation platform. You have:
- 500,000 documentation pages across 2,000 projects
- Pages belong to projects, have tags, have explicit access permissions
- 50 edits per minute on average
- Search must return results scoped to the user's permitted projects
- You want semantic search ("find me docs about async error handling in Rust") plus keyword search
Without pg_trickle: You maintain a Python worker that polls for changed pages every 5 minutes, batches them, calls the embedding API, writes back vectors, and then... has no way to update the HNSW index incrementally. So you rebuild the index nightly. Your search is on data that's up to 24 hours stale. Your permission filtering happens after ANN retrieval, requiring over-fetching. You have three separate codebases (app, worker, index pipeline) and two failure modes.
With pg_trickle + pgvector:
-- 1. Define the search corpus once
SELECT pgtrickle.create_stream_table(
name => 'doc_search_corpus',
query => $$
SELECT
d.id,
d.title,
d.body,
d.embedding,
array_agg(DISTINCT t.name) AS tags,
d.project_id,
p.name AS project_name,
dp.allowed_user_ids AS allowed_users
FROM docs d
JOIN projects p ON p.id = d.project_id
JOIN doc_perms dp ON dp.doc_id = d.id
LEFT JOIN doc_tags dt ON dt.doc_id = d.id
LEFT JOIN tags t ON t.id = dt.tag_id
WHERE d.published = true
GROUP BY d.id, p.name, dp.allowed_user_ids
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.15
);
-- 2. Create indexes for hybrid search
CREATE INDEX ON doc_search_corpus
USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON doc_search_corpus
USING gin (to_tsvector('english', title || ' ' || body));
CREATE INDEX ON doc_search_corpus (project_id);
CREATE INDEX ON doc_search_corpus USING gin (allowed_users);
That's it. Five SQL statements (one stream table, four indexes). From now on:
- Every page edit propagates to the corpus within 10 seconds.
- Every permission change propagates within 10 seconds.
- The HNSW index stays current automatically.
- When 15% of docs have changed, the index is rebuilt concurrently without downtime.
- Your search query is simple:
SELECT id, title, project_name, ts_rank_cd(to_tsvector('english', body), q) AS text_rank,
embedding <=> $1 AS vec_dist
FROM doc_search_corpus,
to_tsquery('english', $2) q
WHERE $3 = ANY(allowed_users) -- ACL filter
AND to_tsvector('english', body) @@ q -- keyword filter
ORDER BY embedding <=> $1 -- vector sort
LIMIT 20;
Vector similarity, full-text ranking, ACL filter — on one table, one query, with correct fresh data.
The Before and After
Let's count what changed.
Before:
- Three codebases to deploy and maintain: app, embedding worker, index rebuild job.
- Average search freshness: up to 24 hours stale (nightly REINDEX).
- Permission filtering happens after ANN retrieval: for a user with access to 5% of projects, you over-fetch 20× candidates, discard 95% of them, and return 10 results.
- Index quality: unknown until someone notices results feel off.
- On-call surface: three failure modes — worker stopped, index diverged, outbox grew unbounded.
After:
- One SQL statement to create the stream table. The rest is PostgreSQL.
- Average search freshness: 10 seconds from edit to searchable.
- Permission filtering is a native indexed array lookup (
$3 = ANY(allowed_users)) on the flat table — zero over-fetching, no candidate waste. - Index quality: tracked as
drift_pctinpgtrickle.vector_status(). Auto-rebuilt when it crosses 15%. - On-call surface: one alert if the refresh lag exceeds your SLA. The system self-heals on restart.
The 50 edits per minute figure is worth unpacking. Over a 24-hour period, that's 72,000 page changes. Under the nightly-batch model, every one of those changes is invisible to search until the next morning's rebuild. Under pg_trickle, each change is propagated within 10 seconds. The HNSW and GIN indexes are updated by PostgreSQL's normal index maintenance — the same mechanism that would update any other btree or gin index when you run an INSERT or UPDATE. There is no special path.
The drift-aware reindex matters here too. At 50 edits per minute, roughly 72,000 out of 500,000 pages change each day — about 14% daily churn. With a 15% threshold, the HNSW index will be rebuilt roughly every day. Since it uses REINDEX CONCURRENTLY, the rebuild happens in the background, the old index serves queries throughout, and the swap is atomic. Your on-call engineer never needs to know it happened.
The Architecture in Plain English
Here's how pg_trickle actually connects the pieces:
Step 1 — CDC capture. When your app inserts or updates a document, PostgreSQL fires an AFTER trigger (installed by pg_trickle). This trigger writes the changed row's key and diff to a tiny change buffer table in the same transaction. The trigger runs in microseconds. Your application never notices.
Step 2 — Scheduler wakeup. pg_trickle's background worker maintains a schedule. For a 10-second interval stream table, it wakes up every 10 seconds. (With IMMEDIATE mode, it can wake up after every single write — sub-millisecond latency.)
Step 3 — Differential computation. The scheduler reads all change buffer entries accumulated since the last refresh. It runs them through the DVM engine, which computes the delta: which rows in the stream table need to be inserted, updated, or deleted.
Step 4 — MERGE. The engine applies the delta to the stream table via a single MERGE statement. For a change to 5 documents out of 500,000, this touches 5 rows. PostgreSQL's normal index AM callbacks fire for each row, updating the HNSW and GIN indexes automatically.
Step 5 — Post-refresh actions. If reindex_if_drift is enabled and the threshold is crossed, the scheduler enqueues an async reindex job in a lower-priority tier.
Step 6 — Monitoring. pgtrickle.vector_status() shows lag, drift, index age, and aggregate counts in real time.
Every step is within PostgreSQL. Every step is ACID-safe. No external processes, no message queues, no separate services.
Monitoring Your Embedding Pipeline
The most dangerous failure mode in any embedding pipeline is the one you can't see. The batch job fell behind three days ago. The worker is running but throwing errors it's swallowing. The HNSW recall degraded over six months and nobody noticed.
pg_trickle's v0.38.0 monitoring view — pgtrickle.vector_status() — is designed to make the invisible visible.
SELECT * FROM pgtrickle.vector_status();
stream_table | embedding_col | index_type | total_rows | rows_changed | drift_pct | last_refresh | refresh_lag_ms | last_reindex | index_age_hours
-------------------+---------------+------------+------------+--------------+-----------+---------------------+----------------+---------------------+-----------------
doc_search_corpus | embedding | hnsw | 498,231 | 11,443 | 2.30 | 2026-04-27 14:32:01 | 2,847 | 2026-04-27 03:00:00 | 11.53
user_taste | taste_vec | hnsw | 1,204,891 | 81,020 | 6.73 | 2026-04-27 14:32:05 | 3,102 | 2026-04-26 20:00:00 | 18.53
Each row tells you:
total_rows— how large the corpus is right now.rows_changed— how many rows have been inserted, updated, or deleted since the last REINDEX. Divide bytotal_rowsto getdrift_pct.refresh_lag_ms— milliseconds since the last successful refresh. For a 10-second schedule, this should stay below ~12,000ms under normal load. If it climbs, your refresh is taking longer than the cycle.last_reindexandindex_age_hours— when the HNSW index was last rebuilt, and how old it is.
For the documentation platform example, doc_search_corpus is 2.3% drifted after ~11 hours. At the current edit rate, it will cross the 15% threshold in about two more days. When it does, REINDEX CONCURRENTLY runs automatically overnight without waking anyone up.
Prometheus and Grafana
The same numbers are exported as Prometheus metrics (shipped alongside the monitoring view in v0.38.0):
# HELP pgtrickle_vector_drift_ratio Fraction of rows changed since last reindex
# TYPE pgtrickle_vector_drift_ratio gauge
pgtrickle_vector_drift_ratio{stream_table="doc_search_corpus"} 0.023
pgtrickle_vector_drift_ratio{stream_table="user_taste"} 0.067
# HELP pgtrickle_vector_refresh_lag_ms Milliseconds since last successful refresh
# TYPE pgtrickle_vector_refresh_lag_ms gauge
pgtrickle_vector_refresh_lag_ms{stream_table="doc_search_corpus"} 2847
pgtrickle_vector_refresh_lag_ms{stream_table="user_taste"} 3102
# HELP pgtrickle_vector_index_age_seconds Seconds since last REINDEX
# TYPE pgtrickle_vector_index_age_seconds gauge
pgtrickle_vector_index_age_seconds{stream_table="doc_search_corpus"} 41508
pgtrickle_vector_index_age_seconds{stream_table="user_taste"} 66708
Two Grafana alerts are enough to cover the production failure surface:
- alert: EmbeddingCorpusStale
expr: pgtrickle_vector_refresh_lag_ms > 30000
for: 3m
annotations:
summary: "{{ $labels.stream_table }} hasn't refreshed in >30s"
description: "Check background worker health and source-table CDC triggers."
- alert: VectorIndexDriftHigh
expr: pgtrickle_vector_drift_ratio > 0.20
for: 10m
annotations:
summary: "{{ $labels.stream_table }} index drift >20%"
description: "Automatic reindex may not be keeping up. Review reindex_drift_threshold."
The first alert catches the case where something is broken (worker dead, schema changed, CDC trigger dropped). The second catches the case where your reindex threshold is too conservative for your write rate. Neither requires the on-call engineer to understand IVFFlat centroids. They just need to know "this dashboard is green when the system is working."
What You Won't Have to Monitor
The things that break in a typical batch-based embedding pipeline and require monitoring:
- Is the embedding worker running?
- Is the outbox queue draining?
- Is the last batch run date recent?
- Is the HNSW index from this year or last year?
- Did the nightly rebuild complete successfully?
With pg_trickle, none of these exist as failure modes. There's no embedding worker (your app writes vectors, pg_trickle maintains the corpus). There's no outbox queue (CDC buffers are written in-transaction and drained automatically). There's no batch run date (refresh is continuous). There's no "when was the index last rebuilt" anxiety (the scheduler handles it, and vector_status() confirms it).
The monitoring surface collapses to two metrics. That's the right outcome.
What's Coming
pg_trickle's pgvector integration is being shipped across four releases:
v0.37.0 (next release): vector_avg and vector_sum algebraic aggregates. Centroid maintenance, recommendation taste vectors, and cluster representatives become fully incremental.
v0.38.0: Post-refresh action hooks (reindex_if_drift), drift tracking, the pgtrickle.vector_status() monitoring view, and a one-command Docker image with pg_trickle + pgvector + pgai pre-installed.
v0.39.0: Sparse-vector aggregates (sparsevec_avg) for SPLADE and learned sparse models. Half-precision aggregates (halfvec_avg) for storage-tiered pipelines. Reactive subscriptions over distance predicates — "alert me when a new transaction embedding enters the fraud zone."
v0.40.0: A high-level embedding_stream_table() API where you describe your corpus in a few parameters and get a fully configured, indexed, monitored stream table back. Research into materialised k-NN graphs for fixed-pivot retrieval. Per-tenant embedding corpora with row-level security.
The Honest Answer About What's Already Working
Some of this is shipping in the next two months. Some is not built yet. Let's be clear:
Working today (no engine changes needed):
- Vector columns pass through the CDC pipeline correctly. If your table has a
vector(1536)column, pg_trickle can maintain stream tables over it. - FULL mode stream tables work with any pgvector expression, including distance operators.
- Denormalized corpora (the multi-table JOIN pattern) work today.
Shipping in v0.37.0:
vector_avgandvector_sumaggregates in DIFFERENTIAL mode.- Distance operators in stream table definitions with a documented, safe FULL-mode fallback.
Shipping in v0.38.0–v0.40.0:
- Drift-aware reindexing, monitoring views, ergonomic API.
If you're using pgvector today and you're tired of babysitting embedding pipelines, the denormalized-corpus pattern works right now. The centroid/aggregate pattern lands in v0.37.
A Different Way to Think About It
pgvector answers: "How do I store and search vectors in PostgreSQL?"
pg_trickle answers: "How do I keep any derived data fresh in PostgreSQL?"
Embeddings are derived data. An embedding of a document is derived from the document's text, metadata, and any transformations you apply. A user taste vector is derived from that user's interaction history. A search corpus is derived from your documents, permissions, tags, and projects.
If you believe that derived data should be maintained automatically and transactionally — rather than rebuilt in batches by external workers — then pg_trickle + pgvector is the natural home for your AI stack.
PostgreSQL already handles your source data transactionally. There's no fundamental reason why the derived data that feeds your AI features should be any different.
Getting Started
Both extensions are single CREATE EXTENSION statements. If you're on a managed PostgreSQL provider (RDS, Cloud SQL, Neon, Supabase, Crunchy, CNPG), pgvector is already available. pg_trickle is available as a Docker image, PGXN package, and direct install.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trickle;
From there:
-- Create your source table with an embedding column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
body TEXT NOT NULL,
embedding vector(1536), -- pre-computed by your app or pgai
project_id INT,
created_at TIMESTAMPTZ DEFAULT now()
);
-- Create a stream table over it
SELECT pgtrickle.create_stream_table(
name => 'docs_search',
query => $$ SELECT id, body, embedding, project_id FROM documents $$,
schedule => '5 seconds'
);
-- Create the ANN index
CREATE INDEX ON docs_search USING hnsw (embedding vector_cosine_ops);
-- Query it
SELECT id, body
FROM docs_search
ORDER BY embedding <=> $1
LIMIT 10;
That's the starting point. No workers, no queues, no ETL. Just two extensions and a SQL statement.
Conclusion
pgvector is an excellent piece of infrastructure. It brings serious vector storage and approximate nearest-neighbour search to PostgreSQL — an environment that already handles transactions, replication, backups, access control, and full-text search. The case for keeping your AI data in PostgreSQL rather than a purpose-built vector database is compelling.
But pgvector has a production gap that the benchmarks don't show: keeping embeddings fresh, indexes current, and derived data synchronized with its sources is genuinely hard. The default answer is "batch jobs, cron schedules, and manual REINDEX." That works until it doesn't.
pg_trickle fills that gap. Not by adding complexity, but by doing what PostgreSQL has always done best — providing a reliable, transactional foundation where data is always correct, always fresh, and always where you expect it.
The combination isn't a workaround. It's what the stack should have looked like all along.
pg_trickle is an open-source PostgreSQL extension. Source code, documentation, and installation instructions are at github.com/trickle-labs/pg-trickle. The pgvector integration roadmap is detailed in the repository's plans/ecosystem/PLAN_PGVECTOR.md and roadmap files.
← Back to Blog Index | Documentation
Incremental Statistical Aggregates: stddev, Percentiles, and Histograms
Which higher-order statistics can be maintained incrementally, which need approximations, and what the trade-offs are
SUM and COUNT are easy to maintain incrementally. Add a row, increment the sum. Remove a row, decrement it. The math is trivial and the result is exact. But real analytics need more: standard deviations, percentiles, histograms, median values, entropy measures. These statistics have different mathematical properties, and not all of them decompose as cleanly as addition.
This post explores the landscape of statistical aggregates from the perspective of incremental maintenance. For each class of statistic, we'll cover: can it be maintained exactly? If not, what approximation is used? What's the space-accuracy trade-off? And how does pg_trickle handle it in practice?
The Incrementability Spectrum
Statistical aggregates fall on a spectrum from "trivially incremental" to "fundamentally requires full data":
| Category | Examples | Incremental? |
|---|---|---|
| Decomposable | SUM, COUNT, MIN, MAX | Exact, O(1) per update |
| Algebraic | AVG, VARIANCE, STDDEV | Exact, O(1) per update (with auxiliary state) |
| Holistic | MEDIAN, MODE, arbitrary percentiles | Not exact without full data |
| Approximate | HyperLogLog (distinct count), t-digest (percentiles) | Bounded error, O(1) per update |
The key insight is that many aggregates that seem expensive are actually algebraic — they can be maintained with a fixed amount of auxiliary state, updated in constant time per change.
Variance and Standard Deviation
Standard deviation looks like it requires a full pass over the data:
$$\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}$$
But expand the formula:
$$\sigma^2 = \frac{\sum x_i^2}{n} - \left(\frac{\sum x_i}{n}\right)^2$$
This means variance can be computed from three running aggregates: SUM(x), SUM(x*x), and COUNT(*). All three are trivially incremental. When a row is inserted with value $v$:
sum_x += vsum_x2 += v*vcount += 1variance = sum_x2/count - (sum_x/count)^2stddev = sqrt(variance)
pg_trickle maintains STDDEV and VARIANCE using this decomposition. The stream table stores the auxiliary aggregates internally and derives the final statistic:
SELECT pgtrickle.create_stream_table(
'price_volatility',
$$
SELECT
product_category,
AVG(price) AS avg_price,
STDDEV(price) AS price_stddev,
VARIANCE(price) AS price_variance,
COUNT(*) AS sample_count
FROM products
GROUP BY product_category
$$
);
Each product insert or price update adjusts the running sums for the affected category. The standard deviation is recomputed from the auxiliary state in O(1). No full scan required.
Covariance and Correlation
Correlation between two variables is computed from their covariance:
$$r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}$$
And covariance decomposes similarly:
$$\text{Cov}(X,Y) = \frac{\sum x_i y_i}{n} - \frac{\sum x_i}{n} \cdot \frac{\sum y_i}{n}$$
Five running aggregates — SUM(x), SUM(y), SUM(x*y), SUM(x*x), SUM(y*y), and COUNT(*) — give you everything needed for correlation. All are incremental.
SELECT pgtrickle.create_stream_table(
'feature_correlations',
$$
SELECT
sensor_type,
CORR(temperature, humidity) AS temp_humidity_corr,
COVAR_SAMP(temperature, pressure) AS temp_pressure_covar,
REGR_SLOPE(power_output, wind_speed) AS power_wind_slope
FROM sensor_readings
GROUP BY sensor_type
$$
);
Linear regression coefficients (REGR_SLOPE, REGR_INTERCEPT, REGR_R2) are all derived from the same five auxiliary aggregates. They're maintained incrementally at the same cost as a simple SUM.
The Percentile Problem
Percentiles (including the median, which is the 50th percentile) are fundamentally different. The median of a set of numbers depends on the ordering of the entire set. You can't compute it from running sums — you need to know which value is at position n/2 in the sorted order.
When a new value is inserted, the median might shift. But determining whether it shifts (and to what) requires knowing the values around the current median. This makes exact incremental maintenance of percentiles expensive — it requires maintaining a sorted data structure (like a B-tree or skip list) with O(log n) insert and O(1) median lookup.
PostgreSQL's PERCENTILE_CONT and PERCENTILE_DISC are ordered-set aggregates that require full data access. pg_trickle cannot maintain them incrementally in the general case.
The workaround: approximate percentiles with t-digest or quantile sketches.
A t-digest is a compact data structure (~1KB) that estimates percentiles with bounded relative error. It supports incremental insertion and merging. You can maintain a t-digest per group, insert new values in O(log k) where k is the compression factor, and query any percentile in O(1).
-- Using pg_trickle with approximate percentiles (via extension)
SELECT pgtrickle.create_stream_table(
'response_time_percentiles',
$$
SELECT
endpoint,
COUNT(*) AS request_count,
AVG(latency_ms) AS avg_latency,
STDDEV(latency_ms) AS stddev_latency,
-- Exact aggregates maintained incrementally ↑
-- For p50/p95/p99, use the materialized data with periodic full refresh
MIN(latency_ms) AS min_latency,
MAX(latency_ms) AS max_latency
FROM requests
GROUP BY endpoint
$$
);
For many practical purposes, min, max, average, and standard deviation (all exactly incremental) give you enough to characterize the distribution. If you need percentiles, consider whether the approximation from assuming a normal distribution (mean ± 2σ ≈ 95th percentile) is sufficient for your use case.
Histograms and Frequency Distributions
A histogram bins values into ranges and counts the frequency per bin. If the bin boundaries are fixed, a histogram is trivially incremental:
SELECT pgtrickle.create_stream_table(
'latency_histogram',
$$
SELECT
endpoint,
CASE
WHEN latency_ms < 10 THEN '0-10ms'
WHEN latency_ms < 50 THEN '10-50ms'
WHEN latency_ms < 100 THEN '50-100ms'
WHEN latency_ms < 500 THEN '100-500ms'
ELSE '500ms+'
END AS bucket,
COUNT(*) AS frequency
FROM requests
GROUP BY endpoint,
CASE
WHEN latency_ms < 10 THEN '0-10ms'
WHEN latency_ms < 50 THEN '10-50ms'
WHEN latency_ms < 100 THEN '50-100ms'
WHEN latency_ms < 500 THEN '100-500ms'
ELSE '500ms+'
END
$$
);
Each new request increments exactly one bucket for its endpoint. The cost is O(1) per insert. This gives you a live histogram that updates with every request — perfect for observability dashboards that show latency distributions in real time.
The key requirement is that bin boundaries are deterministic from the row values. If you use width_bucket() or a CASE expression with fixed thresholds, the histogram is incrementally maintainable. If you use adaptive binning (where boundaries shift based on the data distribution), you need a full recompute when boundaries change.
Distinct Counts with HyperLogLog
Exact distinct counts (COUNT(DISTINCT x)) are maintained by pg_trickle using reference counting — tracking how many times each distinct value appears. When the count drops to zero, the distinct count decreases. This is exact but requires space proportional to the number of distinct values.
For very high cardinality (millions of distinct values), this becomes expensive. HyperLogLog (HLL) approximates distinct counts in a fixed ~1.3KB of space with ~2% relative error. It supports incremental insertion (add a value to the sketch in O(1)) but not deletion (you can't remove a value from an HLL sketch).
For stream tables where rows are only inserted (append-only workloads), HLL is a perfect fit. For workloads with deletions, exact reference counting is needed for correctness, and pg_trickle provides it.
-- Exact distinct counts (incrementally maintained with reference counting)
SELECT pgtrickle.create_stream_table(
'daily_unique_visitors',
$$
SELECT
date_trunc('day', visited_at) AS day,
COUNT(DISTINCT visitor_id) AS unique_visitors,
COUNT(*) AS total_pageviews
FROM pageviews
GROUP BY date_trunc('day', visited_at)
$$
);
Exponentially Weighted Moving Averages
EWMA is used extensively in monitoring (Prometheus uses it) and financial analysis. The formula is:
$$\text{EWMA}t = \alpha \cdot x_t + (1 - \alpha) \cdot \text{EWMA}{t-1}$$
This is inherently sequential — each value depends on the previous EWMA. It's not decomposable, but it is naturally incremental: each new value updates the EWMA in O(1) with no historical data access.
In SQL, EWMA can be expressed as a window function or, for the latest value only, as a recursive computation. Stream tables can maintain the latest EWMA per entity:
SELECT pgtrickle.create_stream_table(
'sensor_ewma',
$$
SELECT
sensor_id,
-- Latest reading weighted against historical exponential average
SUM(value * POWER(0.1, ROW_NUMBER() OVER (PARTITION BY sensor_id ORDER BY ts DESC) - 1))
/ SUM(POWER(0.1, ROW_NUMBER() OVER (PARTITION BY sensor_id ORDER BY ts DESC) - 1))
AS ewma_value
FROM sensor_readings
GROUP BY sensor_id
$$
);
Decision Guide
When choosing how to implement a statistical aggregate as a stream table:
| Statistic | Strategy | Accuracy | Space per group |
|---|---|---|---|
| SUM, COUNT, MIN, MAX | Direct incremental | Exact | O(1) |
| AVG, VARIANCE, STDDEV | Auxiliary sums | Exact | O(1) |
| CORR, COVAR, REGR_* | Auxiliary sums | Exact | O(1) |
| COUNT(DISTINCT) | Reference counting | Exact | O(distinct values) |
| Histogram (fixed bins) | GROUP BY bucket | Exact | O(bins) |
| PERCENTILE | Full refresh or approximation | Exact or ~2% error | O(n) or O(1) |
| MODE | Full refresh | Exact | O(distinct values) |
| EWMA | Sequential update | Exact | O(1) |
The takeaway: most statistics that data engineers use daily (mean, variance, stddev, correlation, histograms) are exactly maintainable with O(1) cost per update. The exceptions are order statistics (percentiles, median, mode) which require either full data access or approximation data structures.
pg_trickle gives you exact, live statistics for the aggregates that matter most — and honest about the ones that need approximation. Know which is which, and design your analytics accordingly.
← Back to Blog Index | Documentation
Incremental Vector Aggregates: Building Recommendation Engines in Pure SQL
How vector_avg turns PostgreSQL into a real-time personalization engine
You have a recommendation system. Users interact with items. Each item has an embedding vector. You want to recommend new items based on what a user has liked before.
The standard approach: compute the average embedding of all items a user has liked (their "taste vector"), then find items closest to that vector in embedding space. This is collaborative filtering by cosine similarity. It works well. It's mathematically sound. And at scale, it falls apart.
Not because the math is wrong. Because keeping a million taste vectors up-to-date as users interact with items — thousands of times per second — is an infrastructure problem that nobody has a clean solution for.
Until now.
The Batch Job You're Running Today
Here's what most recommendation systems look like on the backend:
# runs nightly (or hourly if you're lucky)
for user_id in get_all_active_users():
interactions = db.query("""
SELECT e.embedding
FROM user_actions ua
JOIN item_embeddings e ON e.item_id = ua.item_id
WHERE ua.user_id = %s AND ua.action = 'liked'
""", user_id)
if interactions:
taste_vec = np.mean([r.embedding for r in interactions], axis=0)
db.execute("""
INSERT INTO user_taste_vectors (user_id, taste_vec, updated_at)
VALUES (%s, %s, NOW())
ON CONFLICT (user_id) DO UPDATE
SET taste_vec = EXCLUDED.taste_vec,
updated_at = EXCLUDED.updated_at
""", user_id, taste_vec)
This is a full scan of every active user's interaction history. For a system with 500,000 active users averaging 200 interactions each, that's 100 million rows read, 500,000 vector means computed, and 500,000 upserts — every run.
At nightly cadence, recommendations are up to 24 hours stale. A user likes 10 items in a session. The recommendations don't reflect any of them until tomorrow.
At hourly cadence, you're processing 100 million rows per hour even if only 1,000 users actually did anything. The batch job doesn't know who changed — it recomputes everyone.
The smarter version tracks "users who interacted since last run" and only recomputes those. This helps, but you're still doing a full history scan per changed user. User 42 has 3,000 past interactions and just liked one new item. You read all 3,001 embeddings, compute the mean, write it back. One new interaction cost you a 3,001-row scan.
And you're maintaining all of this in application code — the change tracking, the batch scheduler, the failure handling, the monitoring. It's derived data maintenance disguised as a recommendation system.
Why AVG(vector) Is Hard in a Traditional View
If you could just write a materialized view, you would:
CREATE MATERIALIZED VIEW user_taste AS
SELECT
ua.user_id,
avg(e.embedding) AS taste_vec,
count(*) AS interaction_count
FROM user_actions ua
JOIN item_embeddings e ON e.item_id = ua.item_id
WHERE ua.action = 'liked'
GROUP BY ua.user_id;
The problem: REFRESH MATERIALIZED VIEW does a full recompute. Every time. For every user. With a CONCURRENTLY option if you don't want to lock out readers, which doubles the work.
This is the same scan-everything problem as the batch job, except now PostgreSQL is doing it instead of Python. It's faster per row, but it's still O(all interactions) when only O(changed users) need updating.
Standard PostgreSQL materialized views have no concept of "what changed since last time." They recompute from scratch on every refresh.
The Algebraic Insight
Here's the math that makes incremental vector averaging possible.
The mean of $n$ vectors is:
$$\bar{v} = \frac{1}{n} \sum_{i=1}^{n} v_i$$
If you add one new vector $v_{n+1}$, the new mean is:
$$\bar{v}' = \frac{n \cdot \bar{v} + v_{n+1}}{n + 1}$$
If you remove vector $v_k$, the new mean is:
$$\bar{v}' = \frac{n \cdot \bar{v} - v_k}{n - 1}$$
You don't need to re-scan the history. You need the current sum (or mean + count), the delta, and one arithmetic operation. This is why AVG is called an algebraic aggregate — it can be decomposed into sub-aggregates (SUM and COUNT) that support incremental updates.
pg_trickle's DVM engine has maintained algebraic aggregates since v0.9 for scalar types: AVG(price), SUM(revenue), COUNT(*). The internal representation keeps a running (sum, count) pair per group, and each delta is applied as a simple addition or subtraction.
What v0.37.0 adds is that this same machinery now works for vector types from pgvector. Element-wise. A 1536-dimensional vector sum is 1536 independent sums maintained in parallel. The incremental cost is O(dimensions), not O(history length).
vector_avg and vector_sum in Practice
User Taste Vectors
The canonical example. A user's taste is the average embedding of items they've liked:
SELECT pgtrickle.create_stream_table(
name => 'user_taste',
query => $$
SELECT
ua.user_id,
vector_avg(e.embedding) AS taste_vec,
count(*) AS interaction_count,
max(ua.created_at) AS last_interaction
FROM user_actions ua
JOIN item_embeddings e ON e.item_id = ua.item_id
WHERE ua.action = 'liked'
GROUP BY ua.user_id
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
CREATE INDEX ON user_taste USING hnsw (taste_vec vector_cosine_ops);
When user 42 likes a new item:
- The
INSERT INTO user_actionsfires pg_trickle's CDC trigger. The change is buffered in microseconds. - Within 5 seconds, the scheduler wakes up and processes the change buffer.
- The DVM engine sees that user 42's group is affected. It looks up user 42's current running state:
(sum_vector, count). - It fetches the new item's embedding from
item_embeddings(a single row lookup via the join). - It computes:
new_sum = old_sum + new_embedding,new_count = old_count + 1,new_avg = new_sum / new_count. - It applies a
MERGEto theuser_tastestream table: one row updated for user 42. - The HNSW index on
taste_vecreceives one update.
Total work: one row read from the change buffer, one row lookup in the join, one vector addition, one division, one row merge. Regardless of whether user 42 has 10 or 10,000 past interactions.
Category Centroids
Product categories have an "average" embedding that represents the category's semantic center. Useful for hierarchical navigation, category-to-category similarity, and cold-start recommendations.
SELECT pgtrickle.create_stream_table(
name => 'category_centroids',
query => $$
SELECT
p.category_id,
c.name AS category_name,
vector_avg(p.embedding) AS centroid,
count(*) AS product_count
FROM products p
JOIN categories c ON c.id = p.category_id
WHERE p.active = true
GROUP BY p.category_id, c.name
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
When a product is added to a category, only that category's centroid updates. When a product is reassigned from "Electronics" to "Smart Home," two centroids update — one gains a vector, one loses one. The other 500 categories are untouched.
Document Cluster Representatives
If you're building a RAG system with document clustering (for better retrieval, for deduplication, for topic modeling), you need cluster representatives:
SELECT pgtrickle.create_stream_table(
name => 'cluster_centroids',
query => $$
SELECT
dc.cluster_id,
vector_avg(d.embedding) AS centroid,
count(*) AS doc_count
FROM document_clusters dc
JOIN documents d ON d.id = dc.doc_id
GROUP BY dc.cluster_id
$$,
schedule => '15 seconds',
refresh_mode => 'DIFFERENTIAL'
);
A new document is assigned to a cluster. The cluster's centroid shifts to include it. The shift is exact — not a stale approximation from the last batch run.
vector_sum for Weighted Aggregation
Sometimes you want weighted aggregation — items viewed more recently should contribute more to the taste vector:
SELECT pgtrickle.create_stream_table(
name => 'user_weighted_taste',
query => $$
SELECT
ua.user_id,
vector_sum(e.embedding * ua.weight) AS weighted_sum,
sum(ua.weight) AS total_weight
FROM user_actions ua
JOIN item_embeddings e ON e.item_id = ua.item_id
WHERE ua.action IN ('liked', 'viewed', 'purchased')
GROUP BY ua.user_id
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
-- Application divides at query time:
-- taste_vec = weighted_sum / total_weight
The weight can be anything: a recency decay, an action-type multiplier (purchase = 3, like = 2, view = 1), or a learned user-specific weight. The DVM engine doesn't care — it maintains vector_sum and sum as independent algebraic aggregates.
The Numbers
Let's be concrete about what this costs vs. the batch approach.
Scenario: 1 million users, 200 interactions each on average, 1536-dimensional embeddings. 5,000 new interactions per minute across all users during peak hours.
Batch approach (hourly)
- Rows scanned per batch: depends on change tracking quality. Optimistic case: ~300,000 (60 minutes × 5,000/min). But you scan each changed user's full history, not just the new interaction. Average 200 rows per user. So if 50,000 unique users changed, you scan 50,000 × 200 = 10 million rows.
- Vector operations: 50,000 full mean computations (each averaging 200 vectors of dimension 1536).
- Writes: 50,000 upserts.
- Duration: depends on hardware, but 10M row reads + 50K vector means + 50K upserts is minutes, not seconds.
- Staleness: up to 60 minutes.
pg_trickle differential (5-second schedule)
- Changes per cycle: ~417 (5,000/min × 5s/60s).
- Rows read from change buffer: 417.
- Join lookups (item embedding): 417 single-row index lookups.
- Vector operations per cycle: 417 element-wise additions + 417 scalar increments + 417 divisions.
- Writes: ≤417 row merges (fewer if multiple interactions per user in the same cycle — they're batched by group key).
- Duration: milliseconds.
- Staleness: ≤5 seconds.
The batch job does O(changed_users × avg_history_length) work. The differential approach does O(new_interactions) work. As history grows, the batch gets slower. The differential doesn't care about history length at all.
For a user with 10,000 past interactions who likes one new item, the batch scans 10,001 rows and computes a full vector mean. pg_trickle does one addition and one division.
What About Deletions?
Deletions work the same way, in reverse.
If user 42 unlikes an item, pg_trickle's CDC trigger captures the DELETE from user_actions. The DVM engine computes:
$$\text{new_sum} = \text{old_sum} - \text{removed_embedding}$$ $$\text{new_count} = \text{old_count} - 1$$ $$\text{new_avg} = \frac{\text{new_sum}}{\text{new_count}}$$
One subtraction, one decrement, one division. The taste vector adjusts instantly.
For updates — the user changes a rating, or an item's embedding is recomputed with a new model — the DVM applies the update as a delete of the old value plus an insert of the new value. Two vector operations per affected group.
This is a fundamental property of algebraic aggregates: they support all three DML operations (INSERT, UPDATE, DELETE) with constant-time deltas per group. There's no special case for "what if we need to remove a vector from the average?" It's just subtraction.
Combining with Nearest-Neighbor Search
The real power of maintaining taste vectors as a stream table is that you can index them with HNSW and use them for fast approximate nearest-neighbor search.
"Users like me" queries:
-- Find users with similar taste to user 42
SELECT user_id,
taste_vec <=> (SELECT taste_vec FROM user_taste WHERE user_id = 42) AS distance
FROM user_taste
WHERE user_id != 42
ORDER BY distance
LIMIT 20;
This returns the 20 users whose taste vectors are closest to user 42's (lowest cosine distance) — a user-similarity search over HNSW. The index makes it fast. The stream table makes it fresh.
"Items for me" queries:
-- Find items closest to user 42's taste
SELECT i.id, i.name,
i.embedding <=> ut.taste_vec AS distance
FROM items i, user_taste ut
WHERE ut.user_id = 42
ORDER BY i.embedding <=> ut.taste_vec
LIMIT 20;
This is standard pgvector ANN search, but the query vector is the user's live taste vector rather than a stale batch-computed one.
"Users who would like this item" queries:
-- Given a new item, find users whose taste is closest
SELECT user_id, interaction_count,
taste_vec <=> $new_item_embedding AS affinity
FROM user_taste
ORDER BY taste_vec <=> $new_item_embedding
LIMIT 1000;
This is the push-recommendation pattern: when a new item is added, find the users most likely to want it. With an HNSW index on taste_vec, this runs in milliseconds regardless of user count.
All three queries return results that reflect every interaction up to 5 seconds ago. Not up to the last batch run. Not up to yesterday.
The Consistency Guarantee That Matters
There's a subtle but important property of maintaining taste vectors inside PostgreSQL via pg_trickle rather than in an external system.
When user 42 likes an item and then immediately runs a "show me my recommendations" query, the taste vector is guaranteed to be at least as fresh as the most recent committed write. With a 5-second schedule, the maximum staleness is one refresh cycle. With IMMEDIATE mode, the staleness is zero — the taste vector updates in the same transaction as the like.
Compare this with an external pipeline: the like writes to PostgreSQL, a message is published to a queue, a worker picks it up, computes the new vector, writes it back. The user might see stale recommendations for seconds or minutes, depending on queue depth, worker concurrency, and retry logic. Under load, this gap widens. During a worker outage, it becomes infinite.
pg_trickle's guarantee is simpler: the stream table is maintained by the same database engine that stores the source data. The scheduler runs inside PostgreSQL's background worker framework. There's no message queue, no external service, no network hop. The latency bound is the refresh schedule — a configuration parameter, not an operational variable.
Beyond Taste Vectors: Other Uses for vector_avg
The taste-vector pattern is the most common, but vector_avg is a general-purpose primitive. Some other applications:
Content moderation: Maintain the average embedding of flagged content per user. When a new post's embedding is too close to the user's "flagged content centroid," auto-flag for review.
SELECT pgtrickle.create_stream_table(
name => 'user_flagged_centroid',
query => $$
SELECT user_id, vector_avg(embedding) AS flagged_centroid
FROM posts
WHERE moderation_status = 'flagged'
GROUP BY user_id
$$
);
Search quality monitoring: Track the average embedding of queries that returned zero results. When this centroid shifts significantly, your content coverage has a gap.
Anomaly detection: The average embedding of a time window of events. When the current window's centroid diverges from the historical average, something changed.
Cold-start mitigation: For new users with no interaction history, use the centroid of their demographic cohort or signup-intent cluster as a starting taste vector.
These all share the same property: they're running averages over groups that change incrementally. The DVM engine doesn't know or care whether the vector represents a user preference, a content category, or an anomaly signal. It maintains the algebraic aggregate. You decide what it means.
What v0.37.0 Actually Ships
Let's be specific about what's built and what's planned.
Shipping in v0.37.0:
vector_avg(vector)— element-wise mean with running(sum_vector, count)state. Compatible with pgvector 0.7+.vector_sum(vector)— element-wise sum. For weighted aggregation patterns where the application divides at query time.- Both aggregates work in DIFFERENTIAL mode with INSERT, UPDATE, and DELETE deltas.
- Both work with HNSW and IVFFlat indexes on the output column.
- Criterion benchmark baseline: microseconds per vector for the reducer, establishing a regression gate for future releases.
- pgvector added to the E2E test Docker image, with integration tests for all aggregate patterns.
Not in v0.37.0 (coming in v0.39.0):
halfvec_avg/halfvec_sum— half-precision aggregates for storage-tiered pipelines.sparsevec_avg/sparsevec_sum— sparse vector aggregates for SPLADE and learned sparse models.
Not in v0.37.0 (coming in v0.38.0):
- Drift-aware HNSW reindexing (
post_refresh_action => 'reindex_if_drift'). pgtrickle.vector_status()monitoring view.
The v0.37.0 scope is intentionally focused: get the core algebraic aggregate right, ship it with tests and benchmarks, and build the operational features on top in the next release. The aggregate math has to be bulletproof before you build monitoring and automation around it.
Compared to the Alternatives
Batch recomputation (Python/Celery/Airflow): Works. Scales poorly. The compute cost grows with history length, not change volume. Staleness is bounded by your batch interval. Failure modes are operational nightmares (stale queue, silent worker death, ordering bugs).
Real-time feature stores (Feast, Tecton): These are designed for exactly this use case. They maintain running aggregates over event streams and serve them at query time. The problem: they're separate infrastructure. You need a message bus (Kafka), a compute layer (Spark/Flink), a serving layer, and a consistency model between all of them. For teams already running PostgreSQL, this is a large commitment.
Application-level incremental updates: You can do the math yourself in application code. When user 42 likes an item, read their current (sum_vec, count), add the new embedding, divide, write back. This works for simple cases. It breaks when you need atomic updates across multiple groups, when you need to handle concurrent writes correctly, or when the aggregation involves JOINs (e.g., the item embedding lives in a different table from the user action).
pg_trickle: The aggregation is defined in SQL. The incremental math is handled by the DVM engine. The consistency is guaranteed by PostgreSQL's ACID semantics. The scheduling is a configuration parameter. There's no external infrastructure. It's one extension.
The trade-off: you're committing to PostgreSQL as your compute layer. If you're already running PostgreSQL for your operational data (which you probably are), this isn't a trade-off. It's a simplification.
Getting Started
-- Prerequisites
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trickle;
-- Your existing tables
CREATE TABLE items (
id SERIAL PRIMARY KEY,
name TEXT,
embedding vector(1536)
);
CREATE TABLE user_actions (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL,
item_id INT NOT NULL REFERENCES items(id),
action TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
-- One statement: live taste vectors for all users
SELECT pgtrickle.create_stream_table(
name => 'user_taste',
query => $$
SELECT
ua.user_id,
vector_avg(i.embedding) AS taste_vec,
count(*) AS interactions
FROM user_actions ua
JOIN items i ON i.id = ua.item_id
WHERE ua.action IN ('liked', 'purchased')
GROUP BY ua.user_id
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
-- Index for fast ANN queries
CREATE INDEX ON user_taste USING hnsw (taste_vec vector_cosine_ops);
-- Query: items for user 42
SELECT i.id, i.name,
i.embedding <=> ut.taste_vec AS distance
FROM items i, user_taste ut
WHERE ut.user_id = 42
ORDER BY i.embedding <=> ut.taste_vec
LIMIT 10;
That's the entire recommendation backend. Two extensions, one stream table, one index, one query. The taste vectors stay fresh as users interact. The HNSW index stays current. No workers, no queues, no batch jobs.
Conclusion
The gap between "we have user interaction data and item embeddings" and "we have a working recommendation system" has always been an infrastructure gap, not an algorithmic one. The math is simple — it's averaging vectors. The hard part is maintaining those averages at scale, in real time, correctly, without building a distributed system.
vector_avg in pg_trickle v0.37.0 closes that gap. It takes the same algebraic-aggregate machinery that's been maintaining SUM, COUNT, and AVG over scalar values since v0.9, and extends it to vector types. The incremental cost is proportional to the number of new interactions, not the size of the history. The consistency is ACID. The infrastructure is one PostgreSQL extension.
Your recommendation engine doesn't need Kafka, Flink, Redis, or a feature store. It needs a running average that stays up-to-date. That's what vector_avg is.
pg_trickle is an open-source PostgreSQL extension. Source code, documentation, and installation instructions are at github.com/trickle-labs/pg-trickle. Vector aggregate support is detailed in the v0.37.0 roadmap.
← Back to Blog Index | Documentation
IVM Without Primary Keys
How content hashing lets pg_trickle track changes in keyless tables
You have a table with no primary key. Maybe it's a log table, an external feed you don't control, or a legacy table that predates your team. You want to create a stream table over it. Can you?
Yes. pg_trickle doesn't require primary keys on source tables. It uses content-based hashing to generate a synthetic row identity — __pgt_row_id — that serves the same purpose for change tracking.
This post explains how it works, what the edge cases are, and when you should add a primary key anyway.
Why Row Identity Matters
Incremental view maintenance needs to answer a question for every source row: "Is this row new, changed, or deleted?"
With a primary key, the answer is straightforward:
- If the PK exists in the change buffer but not in the previous result → INSERT.
- If the PK exists in both but values differ → UPDATE.
- If the PK was in the previous result but not in the current data → DELETE.
Without a primary key, there's no stable identifier to track across refresh cycles. Two rows with identical values are indistinguishable. A row that appears twice might be a duplicate or a re-insert.
The Content-Hash Approach
When a source table has no primary key, pg_trickle generates a row identity by hashing the row's content:
__pgt_row_id = xxHash64(col1 || col2 || col3 || ...)
The hash is computed over all columns in the row (or all columns referenced by the stream table's defining query, when column-level change tracking is active). pg_trickle uses xxHash64 — a fast, non-cryptographic hash function — because the goal is deduplication, not security.
This hash is stored in the change buffer alongside the row data:
-- What a change buffer row looks like internally
SELECT __pgt_row_id, __pgt_op, old_col1, old_col2, new_col1, new_col2
FROM pgtrickle_changes.changes_12345;
__pgt_row_id | __pgt_op | old_col1 | old_col2 | new_col1 | new_col2
-------------------+----------+----------+----------+----------+----------
a1b2c3d4e5f67890 | I | NULL | NULL | 42 | 'hello'
f0e1d2c3b4a59678 | D | 17 | 'world' | NULL | NULL
How It Handles Duplicates
Content hashing means that two rows with identical values produce the same __pgt_row_id. This is by design — from the perspective of the stream table query, two identical rows are interchangeable.
But it creates a counting problem. If a table has three identical rows and one is deleted, the result should still include two copies. pg_trickle tracks this with multiplicity counting in the Z-set:
- Insert identical row → weight +1
- Delete identical row → weight -1
- Net weight = number of copies remaining
The hash identifies the value, and the weight tracks the count. This is the same multiset arithmetic described in the Z-set post, applied at the change-tracking level.
When Primary Keys Exist
When a source table does have a primary key, pg_trickle uses it directly:
__pgt_row_id = xxHash64(primary_key_columns)
The hash is over the PK columns only, not the entire row. This is faster (fewer bytes to hash) and more stable (the PK doesn't change on UPDATE, so the row identity is preserved even when non-key columns change).
For composite primary keys:
__pgt_row_id = xxHash64(pk_col1 || pk_col2 || pk_col3)
The practical difference: with a PK, an UPDATE is tracked as a single row with old and new values. Without a PK, an UPDATE looks like a DELETE of the old row (full content hash) plus an INSERT of the new row (different full content hash). Both are correct; the PK version is slightly more efficient because it avoids recomputing the hash.
The Hash Collision Question
xxHash64 produces a 64-bit hash. With $2^{64}$ possible values, the probability of a collision is:
$$P(\text{collision}) \approx \frac{n^2}{2^{65}}$$
For 1 billion rows ($n = 10^9$):
$$P \approx \frac{10^{18}}{3.7 \times 10^{19}} \approx 2.7%$$
That's not zero. For tables with billions of rows, hash collisions are a real possibility.
What happens if two different rows hash to the same __pgt_row_id?
pg_trickle treats them as the same row. This can cause:
- A DELETE of one row being interpreted as a DELETE of the other.
- Change buffer deduplication merging two distinct changes.
In practice, this manifests as a minor count discrepancy in the stream table. pg_trickle's periodic FULL refresh (triggered by AUTO mode or manual intervention) corrects any accumulated drift.
If collision risk is unacceptable: Add a primary key. Even a synthetic BIGSERIAL column eliminates the problem entirely.
Foreign Tables and Keyless Sources
Foreign tables — postgres_fdw, file_fdw, parquet_fdw — often lack primary keys. pg_trickle handles them with content hashing, but with a caveat:
Foreign tables can't have triggers (in most FDW implementations). pg_trickle falls back to polling-based change detection: it periodically compares the current foreign table contents with the last known state, using the content hash to identify what changed.
This is more expensive than trigger-based CDC (it requires a full scan of the foreign table), but it works. For small-to-medium foreign tables (under 1M rows), the scan is fast enough. For larger tables, consider materializing the foreign data into a local table first.
Log Tables and Append-Only Sources
Keyless tables are especially common for log-style data: event tables, audit logs, sensor readings. These are naturally append-only — rows are inserted and never updated or deleted.
For these, pg_trickle's append_only flag is the right combination with content hashing:
SELECT pgtrickle.create_stream_table(
name => 'hourly_event_counts',
query => $$
SELECT
event_type,
date_trunc('hour', created_at) AS hour,
COUNT(*) AS count
FROM events
GROUP BY event_type, date_trunc('hour', created_at)
$$,
schedule => '5s',
append_only => true
);
With append_only => true, pg_trickle skips DELETE/UPDATE tracking entirely. The content hash is still generated for deduplication, but the absence of deletes means the multiplicity counting is simplified — every hash entry is weight +1.
Checking Row Identity Mode
You can see how pg_trickle is tracking row identity for each source:
SELECT
source_table,
row_id_mode,
row_id_columns
FROM pgtrickle.pgt_dependencies
WHERE pgt_name = 'hourly_event_counts';
source_table | row_id_mode | row_id_columns
--------------+----------------+-----------------
events | content_hash | {event_type,created_at,user_id,payload}
customers | primary_key | {id}
row_id_mode is either primary_key (using the PK) or content_hash (hashing all referenced columns).
Performance Impact
Content hashing is fast — xxHash64 processes data at ~10 GB/s on modern hardware. For a row with 500 bytes of data, the hash takes ~50 nanoseconds. Even at 100,000 rows/second, hashing adds <5ms of overhead per second.
The real cost isn't the hash computation. It's the wider change buffer rows. With a primary key, the change buffer stores only the PK columns plus the changed columns. With content hashing, it stores all referenced columns (to reconstruct the hash for deduplication).
For narrow tables (5–10 columns), the difference is negligible. For wide tables (50+ columns), the change buffer can be 2–5× larger without a PK. If storage or I/O is a concern, add a primary key.
Summary
pg_trickle doesn't require primary keys. It uses xxHash64 content hashing to generate synthetic row identities, with multiplicity counting to handle duplicates correctly.
The trade-offs:
- No PK: Works everywhere. Wider change buffers. Theoretical collision risk at billion-row scale.
- With PK: More efficient. Stable identity across UPDATEs. Zero collision risk.
For log tables, foreign tables, and legacy tables without keys, content hashing just works. For everything else, a primary key is still the better choice — not because pg_trickle requires it, but because your database does.
← Back to Blog Index | Documentation
The 45ms Cold-Start Tax and How L0 Cache Eliminates It
Why connection poolers pay a hidden penalty — and the process-local cache that fixes it
You're using PgBouncer in transaction mode. Everything is fast. Then you look at the p99 refresh latency and notice occasional 50ms spikes on stream tables that normally refresh in 5ms.
The spikes aren't random. They happen when a refresh runs on a PostgreSQL backend that hasn't seen that stream table before. The backend needs to parse the delta query template, prepare execution plans, and load metadata. This cold-start overhead is ~45ms — invisible in a dedicated-connection world, but it shows up when connection poolers recycle backends across different workloads.
pg_trickle's L0 cache (introduced in v0.36.0) eliminates this. It's a process-local, in-memory cache that stores parsed templates and metadata. When a backend is reused, the template is already there. Cold start becomes warm start.
The Cold-Start Problem
Each PostgreSQL backend is an independent process. When a backend first executes a stream table refresh, it needs to:
- Read the catalog entry: Query
pgtrickle.pgt_stream_tablesfor the defining query, schedule, refresh mode, and dependencies. - Parse the delta query template: Convert the defining query into a delta query with change buffer joins, aggregate adjustments, and MERGE logic.
- Prepare the execution plan: PostgreSQL's planner creates a query plan for the delta query.
- Load dependency metadata: Frontier positions, change tracking state, CDC mode for each source table.
Steps 1–4 take ~45ms on a typical workload. After the first refresh, PostgreSQL caches the prepared statement (step 3), so subsequent refreshes on the same backend skip the planning. But steps 1 and 2 — the catalog read and template parse — happen every time a new backend handles the stream table.
With a dedicated connection per backend (the traditional model), this 45ms penalty happens once: at first refresh after startup. With PgBouncer in transaction mode, it happens every time the refresh is assigned to a backend that hasn't seen it before — which can be every cycle if the pool is large enough.
The L0 Cache
The L0 cache is a per-backend, in-memory hash map that stores:
- Parsed delta query templates keyed by
(pgt_id, cache_generation). - Catalog metadata snapshots (schedule, refresh mode, dependency list).
- Pre-computed MERGE SQL for each stream table.
L0 Cache: RwLock<HashMap<(pgt_id, cache_generation), CachedTemplate>>
When a backend needs to refresh a stream table:
- Check the L0 cache for the template.
- If found and the
cache_generationmatches the current catalog generation → use the cached template. Skip parsing and catalog reads. - If not found or generation mismatch → parse the template, cache it, proceed.
Cache generation is a monotonically increasing counter that bumps whenever a stream table's definition changes (ALTER, DROP, CREATE). This ensures stale templates are invalidated without explicit cache eviction.
Performance Impact
Benchmarks on a 4-core PostgreSQL 18 instance with PgBouncer (transaction mode, 20-connection pool):
| Metric | Without L0 Cache | With L0 Cache |
|---|---|---|
| First refresh on new backend | 47ms | 47ms (cache miss) |
| Subsequent refresh, same backend | 5ms | 5ms (no change) |
| Refresh after backend recycled | 47ms | 5ms (cache hit) |
| p50 latency (steady state) | 5ms | 5ms |
| p99 latency (steady state) | 48ms | 6ms |
The p99 improvement is dramatic. Without L0 cache, the p99 is dominated by cold-start events (backends seeing the stream table for the first time after recycling). With L0 cache, the parsed template survives backend recycling.
How It Survives Backend Recycling
"Wait — if PgBouncer recycles backends, doesn't the in-memory cache get destroyed?"
No. PgBouncer doesn't destroy backends. It reassigns them. The PostgreSQL process continues running; it just handles a different client connection. The L0 cache lives in the process's memory, which persists across connection reassignments.
The cache is invalidated only when:
- The backend process exits (rare in normal operation).
- The
cache_generationchanges (stream table was altered). - The cache reaches
template_cache_max_entriesand evicts the least-recently-used entry. - The cache entry exceeds
template_cache_max_age_hoursand is considered stale.
Configuration
The L0 cache is enabled by default since v0.36.0:
-- Cache size (entries)
SHOW pg_trickle.template_cache_max_entries;
-- 128
-- Maximum age before eviction
SHOW pg_trickle.template_cache_max_age_hours;
-- 24
Sizing: Each cached template uses ~1–5KB of memory, depending on query complexity. With max_entries = 128, the cache uses at most ~640KB per backend. For a 100-connection pool, that's ~64MB total — negligible on any modern server.
For large deployments (500+ stream tables): Increase max_entries to avoid eviction:
SET pg_trickle.template_cache_max_entries = 1024;
Monitoring Cache Effectiveness
SELECT * FROM pgtrickle.cache_stats();
backend_pid | entries | hits | misses | hit_rate | oldest_entry_age
-------------+---------+---------+--------+----------+-------------------
12345 | 15 | 4,521 | 17 | 99.6% | 2h 14m
12346 | 23 | 8,302 | 25 | 99.7% | 4h 01m
12347 | 8 | 1,203 | 9 | 99.3% | 0h 45m
Target: hit_rate > 99%. If it's lower, the cache is too small (entries being evicted) or stream tables are being altered frequently (generation bumps invalidate entries).
Diagnosis:
hit_rate < 95%with highmisses→ increasemax_entries.hit_rate < 95%with lowentries→ stream tables are being altered frequently. This is expected during development; in production it should stabilize.
Without PgBouncer: Does L0 Cache Still Help?
Yes, but less dramatically.
Without a connection pooler, each backend is dedicated to one connection. The cold-start happens once (at first refresh) and never again. The L0 cache just makes that first refresh slightly faster by caching across pg_trickle.enabled toggles or after ALTER EXTENSION UPDATE.
The big win is with connection poolers: PgBouncer, pgcat, Supavisor, odyssey. Any pooler that reassigns backends between clients will see the p99 improvement.
The Broader Picture: Caching Layers
pg_trickle has multiple caching layers:
| Layer | Scope | What's Cached | Lifetime |
|---|---|---|---|
| L0 | Per-backend (process-local) | Parsed templates, catalog metadata | Until backend exit, generation bump, or LRU eviction |
| PostgreSQL plan cache | Per-backend | Query execution plans | Until backend exit or invalidation |
| Shared buffers | Shared across backends | Table/index pages | LRU eviction |
| OS page cache | System-wide | Disk blocks | LRU eviction |
L0 sits above the PostgreSQL plan cache. Even if the plan cache has the execution plan, the template parsing (steps 1–2) is L0's domain. Both layers contribute to refresh performance; L0 handles the pg_trickle-specific overhead that the plan cache doesn't cover.
Summary
The L0 process-local template cache eliminates the ~45ms cold-start penalty that connection-pooler workloads pay when a backend handles a stream table for the first time.
It's a RwLock<HashMap> keyed by (pgt_id, cache_generation). Hits are <1ms. Misses pay the full parse cost once and cache the result. Generation tracking ensures stale entries are invalidated without explicit eviction.
If you're running PgBouncer in transaction mode, the L0 cache is the difference between a 5ms p99 and a 48ms p99. It's enabled by default. Check cache_stats() to verify it's working.
← Back to Blog Index | Documentation
LATERAL Joins in a Stream Table
Row-scoped re-execution for the most powerful join type in PostgreSQL
LATERAL is the most powerful and least understood join in PostgreSQL. It lets the right side of a join reference columns from the left side — turning each left row into a separate subquery context. JSON_TABLE, unnest(), generate_series(), and correlated subqueries all use LATERAL under the hood.
Making LATERAL incremental is fundamentally different from making a regular join incremental. In a regular join, you can pre-filter the delta: changed rows on one side are joined with all rows on the other side. In a LATERAL join, the right side is parameterized by the left side. There's no global right-side table to join against.
pg_trickle handles LATERAL with row-scoped re-execution: for each changed left-side row, re-run the LATERAL subquery for that row.
How Regular Join Deltas Work
Recall the delta rule for a regular join:
Δ(A ⋈ B) = ΔA ⋈ B ∪ A ⋈ ΔB
Changed rows on the left side are joined with the full right side. Changed rows on the right side are joined with the full left side. This works because both sides are independent tables.
How LATERAL Deltas Work
A LATERAL join doesn't have an independent right side. The right side is a function of each left-side row:
SELECT o.order_id, o.items_json, i.*
FROM orders o,
LATERAL json_to_recordset(o.items_json)
AS i(product_id INT, quantity INT, price NUMERIC);
Here, the LATERAL subquery (json_to_recordset) takes o.items_json as input. There is no "right-side table" — the right side is computed per row.
The delta rule becomes:
Left side changes (Δorders): For each changed order, execute the LATERAL subquery for that order. The result is the delta for those rows.
Right side can't change independently. The right side is derived from the left side. If the JSON column changes, that's a left-side change.
This simplification is the key insight: LATERAL deltas only need to consider left-side changes.
Row-Scoped Re-Execution
When a row changes on the left side of a LATERAL join, pg_trickle:
- For deleted rows: Remove all result rows that were produced by the old left-side row. (The previous LATERAL expansion is no longer valid.)
- For inserted rows: Execute the LATERAL subquery for the new row. Insert the results.
- For updated rows: Remove old results (step 1), then insert new results (step 2).
This is "re-execution" because the LATERAL subquery is literally re-run for each affected left-side row. It's "row-scoped" because only the affected rows are processed.
Practical Examples
JSON Document Unpacking
SELECT pgtrickle.create_stream_table(
name => 'order_line_items',
query => $$
SELECT
o.order_id,
o.customer_id,
i.product_id,
i.quantity,
i.price,
i.quantity * i.price AS line_total
FROM orders o,
LATERAL jsonb_to_recordset(o.line_items)
AS i(product_id INT, quantity INT, price NUMERIC)
$$,
schedule => '5s'
);
When a new order is inserted, pg_trickle unpacks its line_items JSON and inserts the resulting rows. When an order is updated (items added or removed), the old line items are deleted and the new ones are inserted.
Unnesting Arrays
SELECT pgtrickle.create_stream_table(
name => 'user_tags_flat',
query => $$
SELECT u.user_id, u.name, t.tag
FROM users u,
LATERAL unnest(u.tags) AS t(tag)
$$,
schedule => '10s'
);
Each user has an array of tags. The LATERAL unnest() flattens them. When a user's tag array changes, only that user's rows are re-expanded.
Set-Returning Functions
SELECT pgtrickle.create_stream_table(
name => 'date_ranges',
query => $$
SELECT
e.event_id,
e.title,
d.day::date AS event_day
FROM events e,
LATERAL generate_series(e.start_date, e.end_date, '1 day') AS d(day)
$$,
schedule => '30s'
);
Each event spans a date range. The LATERAL generate_series() produces one row per day. When an event's dates change, only that event's rows are recalculated.
JSON_TABLE (PostgreSQL 17+)
SELECT pgtrickle.create_stream_table(
name => 'api_responses_parsed',
query => $$
SELECT
r.request_id,
r.endpoint,
j.*
FROM api_responses r,
LATERAL JSON_TABLE(
r.response_body,
'$.results[*]'
COLUMNS (
item_id INT PATH '$.id',
status TEXT PATH '$.status',
score NUMERIC PATH '$.score'
)
) AS j
$$,
schedule => '5s'
);
JSON_TABLE is syntactic sugar for a LATERAL join over JSON path expressions. pg_trickle handles it with the same row-scoped re-execution strategy.
Performance Characteristics
The cost of a LATERAL delta is:
cost = |changed left rows| × avg_cost(LATERAL subquery per row)
This is efficient when:
- Few left-side rows change per cycle. 5 changed orders → 5 LATERAL re-executions.
- The LATERAL subquery is fast per row.
unnest()andjson_to_recordset()are microsecond-level.
It's expensive when:
- Many left-side rows change. 10,000 changed orders → 10,000 re-executions. At this point, FULL refresh may be faster.
- The LATERAL subquery is expensive. If the subquery hits a large table or calls a slow function, the per-row cost multiplies quickly.
| Scenario | Left changes | Per-row LATERAL cost | Total delta cost |
|---|---|---|---|
| JSON unpacking, 5 new orders | 5 | 0.1ms | 0.5ms |
| Array unnest, 50 user updates | 50 | 0.05ms | 2.5ms |
| generate_series, 10 events | 10 | 0.2ms | 2ms |
| Expensive function, 1000 changes | 1000 | 5ms | 5,000ms |
For the last case, pg_trickle's AUTO mode will detect the high cost and switch to FULL refresh.
LATERAL with Aggregation
A common pattern: LATERAL to expand, then GROUP BY to aggregate.
SELECT pgtrickle.create_stream_table(
name => 'order_item_stats',
query => $$
SELECT
o.customer_id,
COUNT(i.product_id) AS total_items,
SUM(i.quantity * i.price) AS total_value
FROM orders o,
LATERAL jsonb_to_recordset(o.line_items)
AS i(product_id INT, quantity INT, price NUMERIC)
GROUP BY o.customer_id
$$,
schedule => '5s'
);
pg_trickle processes this as a two-stage pipeline:
- LATERAL stage: Row-scoped re-execution for changed orders. Produces the expanded line items.
- Aggregate stage: Algebraic delta on the aggregates. The expanded items feed into the
SUM/COUNTdelta rules.
Only the groups (customers) whose orders changed are updated. The algebraic delta on the aggregate side is O(affected groups), not O(all rows).
Limitations
Volatile functions in LATERAL: If the LATERAL subquery calls a VOLATILE function (e.g., random(), clock_timestamp()), the result is non-deterministic. pg_trickle rejects VOLATILE functions in DIFFERENTIAL and IMMEDIATE mode queries:
ERROR: VOLATILE function random() cannot be used in DIFFERENTIAL mode
HINT: Use FULL refresh mode or replace with a STABLE/IMMUTABLE alternative
Large outer tables with small LATERAL results: If the outer table has 10 million rows and the LATERAL produces 1 row per outer row (a correlated scalar subquery in disguise), you might be better off rewriting as a regular join.
LATERAL referencing multiple outer tables: In multi-way joins, LATERAL can reference columns from any previously joined table. pg_trickle supports this, but the re-execution scope is determined by the outermost changed table. If the outermost table changes, all downstream LATERAL re-executions trigger.
Summary
LATERAL joins are maintained incrementally via row-scoped re-execution. When a left-side row changes, the LATERAL subquery is re-run for that row. The cost is proportional to the number of changed left-side rows times the per-row LATERAL cost.
This covers JSON_TABLE, unnest(), generate_series(), jsonb_to_recordset(), and any other set-returning function used with LATERAL.
The performance sweet spot: few changed rows, cheap LATERAL subquery. The failure mode: many changed rows or expensive subquery. AUTO mode handles the transition gracefully.
← Back to Blog Index | Documentation
Publishing Stream Tables via Logical Replication
Your stream table as a replication origin — ship derived data to downstream PostgreSQL instances
You have a stream table that aggregates orders into regional revenue summaries. The primary database is in us-east-1. Your analytics team in Europe needs the same data. Your reporting service runs against a separate read replica.
Normally, you'd either replicate the raw orders table and re-aggregate on each downstream instance, or build a CDC pipeline with Debezium to capture the stream table changes.
There's a simpler option: publish the stream table itself via PostgreSQL's built-in logical replication.
A stream table is a regular PostgreSQL table with triggers and a scheduler. It participates in logical replication like any other table. You can create a publication, and downstream subscribers receive the inserts, updates, and deletes that each refresh applies.
Setting It Up
On the Primary (Publisher)
-- Create the stream table (if not already existing)
SELECT pgtrickle.create_stream_table(
name => 'revenue_by_region',
query => $$
SELECT region, date_trunc('day', created_at) AS day,
SUM(total) AS revenue, COUNT(*) AS order_count
FROM orders
JOIN customers ON customers.id = orders.customer_id
GROUP BY region, date_trunc('day', created_at)
$$,
schedule => '5s'
);
-- Create a publication for the stream table
CREATE PUBLICATION revenue_pub FOR TABLE pgtrickle.revenue_by_region;
On the Subscriber
-- Create the target table (matching schema)
CREATE TABLE revenue_by_region (
region TEXT,
day TIMESTAMP,
revenue NUMERIC,
order_count BIGINT
);
-- Create the subscription
CREATE SUBSCRIPTION revenue_sub
CONNECTION 'host=primary-db dbname=prod user=replicator'
PUBLICATION revenue_pub;
That's it. The subscriber receives every change that the stream table refresh applies. When the MERGE step inserts 3 new region-day groups and updates 5 existing ones, the subscriber receives 3 INSERT and 5 UPDATE messages.
What Gets Replicated
Logical replication captures DML operations on the published table. For a stream table, the DML happens during the MERGE step of each refresh:
| Refresh Operation | Replicated As |
|---|---|
| New group appears in result | INSERT |
| Existing group changes (new sum, count) | UPDATE |
| Group disappears from result | DELETE |
| FULL refresh (truncate + reload) | TRUNCATE + INSERTs |
For DIFFERENTIAL refreshes, the subscriber sees fine-grained changes — only the groups that actually changed. For FULL refreshes, the subscriber sees a TRUNCATE followed by a full reload. Both are correct; DIFFERENTIAL produces smaller replication streams.
Why This Is Useful
Offloading Analytics Queries
The primary database handles transactional workloads. Analytical queries on the stream table compete for the same resources. By publishing the stream table to a downstream analytics instance, you separate the workloads:
primary: orders (writes) → stream table refresh (pg_trickle)
↓ logical replication
analytics: revenue_by_region (reads) → Grafana, reports
The analytics instance doesn't need pg_trickle installed. It's just a regular PostgreSQL database receiving replicated data.
Multi-Region Data Distribution
For globally distributed applications, replicate the stream table to each regional database:
us-east-1 (primary): compute revenue_by_region
↓ logical replication
eu-west-1: revenue_by_region (read-only copy)
ap-southeast-1: revenue_by_region (read-only copy)
Each region gets sub-second updates without running its own aggregation pipeline. The primary computes once; subscribers receive the result.
Feeding Non-PostgreSQL Systems
Logical replication can be consumed by tools like Debezium, which decode the replication stream and forward it to Kafka, Elasticsearch, or other systems. Publishing a stream table means Debezium captures the aggregated, processed data — not the raw source tables.
orders → pg_trickle → revenue_by_region → logical replication → Debezium → Kafka
The Kafka consumers receive clean, aggregated events. No need to re-aggregate downstream.
CDC Implications
When a stream table is published via logical replication, there are two layers of change capture:
- CDC on source tables (pg_trickle's trigger/WAL capture): feeds the stream table refresh.
- Logical replication on the stream table: ships refresh results downstream.
These are independent. The first is managed by pg_trickle. The second is standard PostgreSQL logical replication.
If the stream table uses WAL-based CDC (cdc_mode => 'wal'), and you also publish it via logical replication, both consume from the WAL — but they use separate replication slots. The slots are independent; one falling behind doesn't affect the other.
Replication Identity
Logical replication requires a replication identity to identify rows for UPDATE and DELETE. By default, PostgreSQL uses the primary key.
Stream tables may not have a primary key in the traditional sense. pg_trickle's internal row identity (__pgt_row_id) is a hidden column. For logical replication, you need an explicit identity:
Option 1: Add a primary key on the stream table (if the output has a natural key):
-- After creating the stream table
ALTER TABLE pgtrickle.revenue_by_region
ADD PRIMARY KEY (region, day);
Option 2: Use REPLICA IDENTITY FULL:
ALTER TABLE pgtrickle.revenue_by_region
REPLICA IDENTITY FULL;
This tells PostgreSQL to include all column values in UPDATE/DELETE WAL records. It's less efficient (larger WAL records) but works without a primary key.
Recommendation: If the stream table has a natural key (which most GROUP BY stream tables do — the group-by columns), add a primary key. It's better for replication performance and makes the subscriber's conflict resolution easier.
Multiple Stream Tables in One Publication
You can publish multiple stream tables in a single publication:
CREATE PUBLICATION analytics_pub FOR TABLE
pgtrickle.revenue_by_region,
pgtrickle.customer_metrics,
pgtrickle.product_rankings;
The subscriber creates matching tables and receives changes from all three. Each stream table's refresh produces independent changes; the subscriber applies them in commit order.
Latency
The end-to-end latency from source change to subscriber visibility:
source DML → CDC capture (~0-15ms) → scheduler dispatch (~0-1000ms) →
refresh execution (~5-500ms) → WAL write → logical replication (~1-50ms) →
subscriber apply (~1-10ms)
Total: typically 100ms–2s. The dominant factor is the scheduler interval (how often the stream table is refreshed). With a 1-second schedule and DIFFERENTIAL mode, expect ~1–2 seconds end-to-end.
For IMMEDIATE mode stream tables, the refresh happens in the same transaction as the source DML. The subscriber sees the change as soon as the transaction commits and the WAL is shipped. End-to-end latency: typically 5–50ms.
Conflict Handling on the Subscriber
If the subscriber table is read-only (no local writes), there are no conflicts. This is the recommended setup.
If the subscriber has local writes (not recommended for replicated stream tables), standard PostgreSQL logical replication conflict handling applies: duplicates cause the subscription to stall until resolved.
Summary
Stream tables are regular PostgreSQL tables. They participate in logical replication like any other table. Create a publication, set up a subscriber, and the stream table's refresh deltas ship downstream.
Use this for:
- Offloading analytics queries to a separate instance
- Multi-region distribution of aggregated data
- Feeding Debezium/Kafka with clean, aggregated events
Set a replication identity (preferably a primary key on the group-by columns) and you're done. The primary computes, subscribers receive, no custom CDC pipeline required.
← Back to Blog Index | Documentation
The Medallion Architecture Lives Inside PostgreSQL
Bronze, Silver, Gold — without Spark, without Airflow, without leaving your database
The medallion architecture is a data engineering pattern. Raw data lands in a Bronze layer. It gets cleaned and deduplicated into Silver. Business-level aggregates live in Gold. Dashboards and applications read from Gold.
The pattern came from the Spark/Databricks world. Most implementations involve Spark jobs, Delta Lake tables, an Airflow DAG to orchestrate the pipeline, and a scheduler to run it all. The Bronze-to-Silver-to-Gold pipeline typically runs on a schedule — hourly, maybe every 15 minutes if you're aggressive.
pg_trickle implements the same architecture entirely inside PostgreSQL. No Spark. No Airflow. No external scheduler. Propagation time from Bronze to Gold: under 5 seconds.
The Setup
Here's a concrete example: an e-commerce platform tracking orders, with fraud detection rules and executive-level KPIs.
Bronze: Raw Ingest
Bronze is just your regular PostgreSQL tables. This is where your application writes.
CREATE TABLE orders (
id bigserial PRIMARY KEY,
customer_id bigint NOT NULL,
amount numeric(12,2) NOT NULL,
currency text NOT NULL DEFAULT 'USD',
status text NOT NULL DEFAULT 'pending',
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE customers (
id bigint PRIMARY KEY,
name text NOT NULL,
region text NOT NULL,
tier text NOT NULL DEFAULT 'standard',
email text
);
Nothing special here. No CDC pipelines, no event sourcing. Just tables.
Silver: Cleaned, Enriched, Joined
The Silver layer is a stream table that joins, cleans, and enriches the raw data:
SELECT pgtrickle.create_stream_table(
'silver_orders',
$$SELECT
o.id AS order_id,
o.customer_id,
c.name AS customer_name,
c.region,
c.tier AS customer_tier,
o.amount,
o.currency,
o.status,
o.created_at,
CASE
WHEN o.amount > 10000 AND c.tier = 'standard' THEN true
ELSE false
END AS flagged_for_review
FROM orders o
JOIN customers c ON c.id = o.customer_id$$,
schedule => '2s',
refresh_mode => 'DIFFERENTIAL'
);
silver_orders is a real table. You can index it, query it with arbitrary WHERE clauses, put a GIN index on it for full-text search if you want. It updates within 2 seconds of any change to orders or customers.
Notice the flagged_for_review column — Silver isn't just a raw copy, it's enriched with business logic at the SQL level.
Gold: Business Aggregates
The Gold layer builds on Silver. It aggregates into the shape your dashboards and APIs actually need:
-- Regional revenue KPIs
SELECT pgtrickle.create_stream_table(
'gold_revenue_by_region',
$$SELECT
region,
date_trunc('day', created_at) AS day,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
AVG(amount) AS avg_order_value,
COUNT(*) FILTER (WHERE flagged_for_review) AS flagged_count
FROM silver_orders
WHERE status != 'cancelled'
GROUP BY region, date_trunc('day', created_at)$$,
schedule => '3s',
refresh_mode => 'DIFFERENTIAL'
);
-- Customer lifetime value
SELECT pgtrickle.create_stream_table(
'gold_customer_ltv',
$$SELECT
customer_id,
customer_name,
region,
customer_tier,
SUM(amount) AS lifetime_value,
COUNT(*) AS total_orders,
MAX(created_at) AS last_order_at
FROM silver_orders
WHERE status != 'cancelled'
GROUP BY customer_id, customer_name, region, customer_tier$$,
schedule => '3s',
refresh_mode => 'DIFFERENTIAL'
);
How the DAG Works
pg_trickle knows that gold_revenue_by_region depends on silver_orders, which depends on orders and customers. This forms a directed acyclic graph:
orders ──┐
├──→ silver_orders ──┬──→ gold_revenue_by_region
customers ┘ └──→ gold_customer_ltv
The scheduler respects this ordering automatically. It won't refresh gold_revenue_by_region until silver_orders is up to date. You don't need to coordinate schedules — set the schedule on each layer and pg_trickle handles the propagation.
You can inspect the DAG:
SELECT * FROM pgtrickle.dependency_tree('gold_revenue_by_region');
This returns the full dependency chain, including depth level and refresh ordering.
Why This Is Better Than the Spark Version
Latency
The Spark medallion pipeline runs on a schedule — typically every 15 minutes to an hour. Between runs, your Gold layer is stale. pg_trickle's pipeline runs continuously. The end-to-end propagation time from an INSERT into orders to an updated row in gold_revenue_by_region is the sum of the schedules: 2 seconds (Silver) + 3 seconds (Gold) = 5 seconds worst case.
Infrastructure
The Spark version requires:
- A Spark cluster (or Databricks workspace)
- Delta Lake or Iceberg for Bronze/Silver/Gold storage
- An Airflow/Dagster/Prefect DAG for orchestration
- A scheduler
- Monitoring for each step
- Permissions and credentials across systems
The pg_trickle version requires:
- PostgreSQL with the pg_trickle extension installed
That's it. The scheduler, CDC, DAG resolution, monitoring, and storage are all inside PostgreSQL.
Consistency
In the Spark version, each layer is independently computed. If the Silver job fails halfway through, Gold may have partial data. Airflow retries help, but the state management is external.
In pg_trickle, each refresh cycle is a PostgreSQL transaction. It either fully commits or fully rolls back. There's no partial state. If a refresh fails, the stream table retains its previous consistent state and the scheduler retries on the next cycle.
Cost
A dedicated Spark cluster costs real money — even on spot instances, the compute budget for a medallion pipeline is significant for small-to-medium teams. pg_trickle runs on your existing PostgreSQL instance. The marginal cost is CPU time during refresh cycles, which for most workloads is negligible.
Adding More Layers
The DAG isn't limited to three levels. You can chain as many stream tables as your use case needs:
-- A "platinum" layer: top-10 customers per region, for the exec dashboard
SELECT pgtrickle.create_stream_table(
'platinum_top_customers_by_region',
$$SELECT region, customer_id, customer_name, lifetime_value
FROM gold_customer_ltv
ORDER BY lifetime_value DESC
LIMIT 10$$,
schedule => '5s',
refresh_mode => 'DIFFERENTIAL'
);
pg_trickle supports arbitrary DAG depth. Each additional layer adds its schedule interval to the end-to-end propagation time, but the refresh cost per layer is proportional only to the delta at that layer — not the total data volume.
Monitoring the Pipeline
Since every stream table has its own refresh stats, you can monitor the pipeline the same way you'd monitor any pg_trickle stream table:
-- Check freshness across all layers
SELECT
st.pgt_name,
st.refresh_mode,
st.schedule,
now() - st.data_timestamp AS staleness,
st.consecutive_errors
FROM pgtrickle.pgt_status() st
WHERE st.pgt_name LIKE 'silver_%' OR st.pgt_name LIKE 'gold_%' OR st.pgt_name LIKE 'platinum_%'
ORDER BY st.pgt_name;
If you enable pg_trickle's self-monitoring (which itself uses stream tables), the extension watches its own refresh latency and alerts when a layer falls behind.
When to Still Use Spark
pg_trickle's medallion architecture works inside a single PostgreSQL instance (or a Citus cluster). If your Bronze layer is petabytes of Parquet files in S3, Spark is the right tool. If your Bronze layer is a PostgreSQL table that your application writes to, the Spark pipeline was always overkill.
The dividing line is roughly: if your data fits in PostgreSQL, your medallion architecture should too.
← Back to Blog Index | Documentation
Migrating from pg_ivm to pg_trickle
Feature gap, SQL differences, and a step-by-step migration procedure
If you're using pg_ivm for incremental view maintenance in PostgreSQL, you probably chose it because it was the only option. It's been around since 2021, it works for simple cases, and it proved that IVM inside PostgreSQL is viable.
But if you've hit its limits — no background scheduling, no multi-table JOIN support in some cases, no monitoring, no DAG resolution — you've probably wondered whether there's an upgrade path.
This post is that upgrade path.
The Feature Gap
Here's what pg_ivm and pg_trickle each support as of mid-2026:
| Feature | pg_ivm | pg_trickle |
|---|---|---|
| Single-table aggregation (SUM, COUNT, AVG) | ✅ | ✅ |
| Multi-table JOINs with aggregation | Partial | ✅ |
| LEFT/RIGHT/FULL OUTER JOINs | ❌ | ✅ |
| HAVING clauses | ❌ | ✅ |
| Window functions (auto-rewrite) | ❌ | ✅ (via auto-rewrite to DIFFERENTIAL) |
| Subqueries in WHERE | ❌ | ✅ |
| CASE expressions in aggregates | Limited | ✅ |
| FILTER clauses on aggregates | ❌ | ✅ |
| Background scheduler | ❌ (manual refresh) | ✅ (configurable schedule, SLA tiers) |
| IMMEDIATE refresh mode | ✅ | ✅ |
| DIFFERENTIAL (deferred) refresh | ❌ | ✅ |
| DAG-aware scheduling | ❌ | ✅ (diamond-safe) |
| Change data capture | Trigger-based | Hybrid (triggers or WAL) |
| Monitoring / health checks | ❌ | ✅ (Prometheus, self-monitoring) |
| Online schema evolution | ❌ | ✅ (alter_stream_table) |
| Transactional outbox/inbox | ❌ | ✅ |
| Row-level security | ❌ | ✅ |
| dbt integration | ❌ | ✅ |
| Citus support | ❌ | ✅ |
| Partitioned source tables | ❌ | ✅ |
| Downstream publications | ❌ | ✅ |
The core difference: pg_ivm implements IMMEDIATE-mode refresh only. Every source table change triggers an inline delta computation. There's no deferred mode, no background worker, and no scheduling.
This means pg_ivm adds overhead to every write, with no way to amortize it. For low-write workloads, that's fine. For high-write workloads, it's a bottleneck.
SQL Differences
Creating an IVM table
pg_ivm:
SELECT create_immv('order_totals',
'SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id');
pg_trickle:
SELECT pgtrickle.create_stream_table(
'order_totals',
$$SELECT customer_id, SUM(amount) AS total
FROM orders GROUP BY customer_id$$,
schedule => '3s',
refresh_mode => 'DIFFERENTIAL'
);
The pg_trickle version is more verbose — you specify the schedule and refresh mode. In exchange, you get control over how and when the view is refreshed.
Refreshing
pg_ivm: Automatic — triggers fire on every source change and update the IMMV inline.
pg_trickle (IMMEDIATE): Same behavior as pg_ivm — delta applied in the source transaction.
pg_trickle (DIFFERENTIAL): Background worker drains change buffers on the configured schedule. No write-path overhead.
Dropping
pg_ivm:
DROP TABLE order_totals;
-- Manually clean up triggers
pg_trickle:
SELECT pgtrickle.drop_stream_table('order_totals');
-- Cleans up storage table, triggers, change buffers, catalog entries
Migration Procedure
Step 1: Inventory your IMMVs
-- Find all pg_ivm maintained views
SELECT relname, pg_get_viewdef(oid)
FROM pg_class
WHERE relname IN (
SELECT immvname FROM pg_ivm_immv
);
Record each IMMV name and its defining query.
Step 2: Create equivalent stream tables
For each IMMV, create a pg_trickle stream table. Start with IMMEDIATE mode to match pg_ivm's behavior:
SELECT pgtrickle.create_stream_table(
'order_totals',
$$SELECT customer_id, SUM(amount) AS total
FROM orders GROUP BY customer_id$$,
refresh_mode => 'IMMEDIATE'
);
If the IMMV query uses features pg_trickle supports but pg_ivm didn't (LEFT JOINs, HAVING, window functions), you can enhance the query during migration.
Step 3: Validate
Compare the pg_ivm IMMV with the pg_trickle stream table:
-- Should return zero rows if both are correct
(SELECT * FROM old_order_totals_immv EXCEPT SELECT * FROM order_totals)
UNION ALL
(SELECT * FROM order_totals EXCEPT SELECT * FROM old_order_totals_immv);
Step 4: Update application queries
If your application queries the IMMV by name, update the references to point to the stream table. If you used the same name, no changes needed.
Step 5: Drop the old IMMVs
-- Remove pg_ivm's IMMV (this also removes the triggers)
SELECT drop_immv('old_order_totals_immv');
Step 6: Optimize refresh modes
Now that you're on pg_trickle, evaluate which stream tables should be DIFFERENTIAL:
-- Switch high-write tables to deferred refresh
SELECT pgtrickle.alter_stream_table('order_totals',
refresh_mode => 'DIFFERENTIAL',
schedule => '2s'
);
This removes the write-path overhead. Your write throughput improves, and the stream table is at most 2 seconds stale.
Step 7: Remove pg_ivm
DROP EXTENSION pg_ivm;
Gotchas
Different NULL handling
pg_ivm and pg_trickle may handle NULL aggregation groups differently. Test your queries with NULL values in the GROUP BY columns to ensure the results match.
Trigger ordering
If you have other triggers on the source tables, be aware that pg_trickle's CDC triggers execute as AFTER triggers. If your existing triggers modify the row (BEFORE triggers), the change captured by pg_trickle will reflect the modified value, which is correct. But if you have AFTER triggers that perform additional writes, check that the ordering doesn't cause issues.
Performance profile change
Moving from IMMEDIATE to DIFFERENTIAL changes the performance profile. IMMEDIATE mode has higher write latency but zero read staleness. DIFFERENTIAL mode has lower write latency but some read staleness. Profile both modes with your actual workload.
Why Switch?
If pg_ivm works for your use case today and you have no plans to:
- Scale write throughput
- Add monitoring
- Chain views (DAG)
- Use background scheduling
- Enable outbox/inbox
- Deploy on Citus
Then staying on pg_ivm is fine. It's simpler and has fewer moving parts.
But if you're hitting any of those needs — or if you're finding pg_ivm's query restriction frustrating — the migration is straightforward. You can run both extensions simultaneously during the transition, validate correctness, and cut over at your own pace.
← Back to Blog Index | Documentation
One PostgreSQL, Five Databases, One Worker Pool
Multi-database pg_trickle: per-database isolation with shared worker scheduling
You're running a SaaS product. Each customer has their own PostgreSQL database on a shared server. You've installed pg_trickle in all of them. Now you have 5 databases, each with its own stream tables, its own scheduler, and its own demand for refresh workers.
How does pg_trickle coordinate across databases without them stepping on each other?
The answer is a two-level architecture: one launcher per server, one scheduler per database, one shared worker pool.
The Launcher
When PostgreSQL starts, pg_trickle registers a single background worker called the launcher. The launcher's job is discovery:
- Connect to each database that has
pg_trickleinstalled. - Start a scheduler process for each database.
- Monitor the schedulers — restart them if they crash.
The launcher uses pg_database to enumerate databases and checks for the pgtrickle schema. Databases without pg_trickle are ignored.
PostgreSQL server
├── pg_trickle launcher (1 per server)
│ ├── scheduler: customer_a_db
│ ├── scheduler: customer_b_db
│ ├── scheduler: customer_c_db
│ ├── scheduler: internal_analytics_db
│ └── scheduler: staging_db
└── shared worker pool (N workers)
Per-Database Isolation
Each database gets its own scheduler process. Schedulers are fully isolated:
- Each scheduler only sees stream tables in its own database.
- Each scheduler maintains its own DAG, its own tier classification, its own refresh history.
- A crash in one scheduler doesn't affect others. The launcher restarts it.
- Error states are per-database — a suspended stream table in
customer_a_dbdoesn't affectcustomer_b_db.
This isolation is critical for multi-tenant deployments. Tenant A's misconfigured stream table (running in a loop, consuming resources) can't degrade Tenant B's refreshes.
The Shared Worker Pool
Refresh workers are a shared resource. pg_trickle maintains a pool of max_dynamic_refresh_workers (default: 4) workers that are dispatched across databases.
When a scheduler determines that a stream table needs refreshing, it requests a worker from the shared pool. The worker connects to the scheduler's database, executes the refresh, and returns to the pool.
-- See current worker allocation
SHOW pg_trickle.max_dynamic_refresh_workers;
-- 4
Fair-Share Scheduling
Without any controls, one busy database could monopolize the worker pool. If customer_a_db has 50 due refreshes and customer_b_db has 5, the pool would spend 90% of its time on Customer A.
pg_trickle prevents this with per-database worker quotas:
-- Each database gets at most 2 workers concurrently
SET pg_trickle.per_database_worker_quota = 2;
With 4 workers and a quota of 2 per database, at least 2 databases can refresh concurrently. No single database can starve the others.
The quota is a ceiling, not a reservation. If only one database has pending work, it can use all 4 workers. The quota only kicks in when there's contention.
Configuration
Most pg_trickle GUCs are per-database (set in each database independently):
-- In customer_a_db:
SET pg_trickle.scheduler_interval_ms = 500; -- fast scheduler
SET pg_trickle.max_concurrent_refreshes = 3; -- up to 3 concurrent refreshes
-- In staging_db:
SET pg_trickle.scheduler_interval_ms = 5000; -- slower scheduler
SET pg_trickle.max_concurrent_refreshes = 1; -- sequential refreshes
Server-wide settings (in postgresql.conf) apply to all databases:
pg_trickle.max_dynamic_refresh_workers = 8 # Total pool size
pg_trickle.per_database_worker_quota = 3 # Per-DB ceiling
pg_trickle.enabled = on # Global switch
If pg_trickle.enabled = off in postgresql.conf, no database runs any refreshes. Individual databases can't override this.
The Database-Per-Tenant Pattern
For SaaS products using database-per-tenant isolation:
tenant_1_db: 5 stream tables (real-time dashboard)
tenant_2_db: 12 stream tables (analytics pipeline)
tenant_3_db: 3 stream tables (inventory tracking)
...
tenant_50_db: 8 stream tables
Each tenant's stream tables are independent. The launcher discovers all 50 databases and starts 50 schedulers. The shared worker pool (say, 16 workers) services all of them with fair-share quotas.
Scaling:
- Add a new tenant → create database, install pg_trickle, create stream tables. The launcher discovers it automatically on the next discovery cycle.
- Remove a tenant → drop the database. The launcher detects the missing database and stops its scheduler.
- Tenant needs more throughput → increase their
max_concurrent_refresheswithin the database. They'll get more workers (up to their quota) when the pool has capacity.
Monitoring Across Databases
Each database reports its own health:
-- In each database
SELECT * FROM pgtrickle.health_summary();
For a cross-database view, query each database from a monitoring system:
for db in customer_a_db customer_b_db customer_c_db; do
psql -d $db -c "SELECT '$db' AS database, * FROM pgtrickle.health_summary();"
done
Or use Prometheus metrics (exposed per-database) and aggregate in Grafana.
Failure Containment
The isolation model means failures are contained:
| Failure | Impact |
|---|---|
Scheduler crash in tenant_1_db | Only tenant_1_db stops refreshing. Launcher restarts it. |
Runaway query in tenant_2_db | Uses one worker. Other databases use remaining workers. |
| Database dropped | Launcher detects and stops the scheduler. No impact on others. |
| pg_trickle disabled in one DB | Only that DB stops. Others continue. |
| Shared worker pool exhausted | All databases queue refreshes. Fair-share ensures no single DB monopolizes. |
The worst case — worker pool exhaustion — affects all databases equally. This is by design: when the system is overloaded, everyone slows down proportionally. No single tenant can cause another tenant's stream tables to stop refreshing entirely.
Sizing the Worker Pool
Rule of thumb: max_dynamic_refresh_workers should be at least equal to the number of databases with "hot" stream tables (tables that change every cycle).
For a server with:
- 10 databases
- 3 databases with real-time dashboards (hot)
- 7 databases with hourly reporting (cold)
Set max_dynamic_refresh_workers = 6 (2× the hot count). The hot databases get immediate workers; cold databases share the remaining capacity.
For CPU-bound refreshes (complex aggregates, many joins), each worker consumes one PostgreSQL backend. Size the worker pool to leave headroom for user connections:
total_connections = max_connections - max_dynamic_refresh_workers - launcher - schedulers
Summary
pg_trickle's multi-database architecture uses one launcher per server to discover databases, one scheduler per database for isolation, and one shared worker pool for resource efficiency.
Per-database quotas prevent monopolization. Failure is contained per-database. New databases are discovered automatically.
For SaaS products with database-per-tenant isolation, this is the architecture that lets pg_trickle scale horizontally without per-tenant infrastructure overhead. One PostgreSQL server, one extension, fair-share scheduling.
← Back to Blog Index | Documentation
Multi-Tenant Vector Search with Row-Level Security and pg_trickle
Zero cross-tenant data leakage without separate tables or databases
Multi-tenant SaaS and vector search are a challenging combination.
The standard options are: one embedding table per tenant (operational nightmare at scale), one shared table filtered by a tenant_id column (correctness depends entirely on never forgetting the WHERE clause), or a dedicated vector database instance per tenant (very expensive, very complex).
PostgreSQL's Row-Level Security (RLS) offers a fourth option: a shared table where the database enforces per-tenant isolation, independent of application code. Combine this with pg_trickle stream tables and you get per-tenant search corpora that are maintained incrementally, correctly isolated, and queryable efficiently.
This post is about how to build that.
The Problem With Shared Tables
The naive approach to multi-tenant vector search:
CREATE TABLE document_embeddings (
id bigserial PRIMARY KEY,
tenant_id bigint NOT NULL,
content text,
embedding vector(1536),
created_at timestamptz DEFAULT NOW()
);
CREATE INDEX ON document_embeddings USING hnsw (embedding vector_cosine_ops);
Application queries then filter by tenant:
SELECT id, content, embedding <=> $query AS distance
FROM document_embeddings
WHERE tenant_id = $current_tenant_id
ORDER BY embedding <=> $query
LIMIT 10;
This works until it doesn't. The failure modes:
Application bugs: A query that forgets the WHERE tenant_id = ? clause leaks data across tenants. This is an application-level constraint enforced only by developer discipline.
Poor ANN performance with filtering: HNSW searches find the approximate nearest neighbors across the entire index, then filters to the tenant. If a tenant has 1% of the total rows, you might retrieve 100 candidates from the full index before finding 10 that pass the tenant filter. You're doing ~10× wasted work.
Uneven distributions: A large tenant dominates the ANN index. Their document embeddings cluster tightly. Smaller tenants have their embeddings scattered across the index graph among the dominant tenant's nodes. Query latency and recall vary wildly across tenants.
Row-Level Security: Database-Enforced Isolation
RLS moves the isolation guarantee from application code to the database engine.
-- Enable RLS on the table
ALTER TABLE document_embeddings ENABLE ROW LEVEL SECURITY;
ALTER TABLE document_embeddings FORCE ROW LEVEL SECURITY;
-- Create a policy that restricts reads to the current tenant
CREATE POLICY tenant_isolation ON document_embeddings
AS RESTRICTIVE
FOR ALL
USING (tenant_id = current_setting('app.current_tenant_id')::bigint);
Now every query against document_embeddings automatically scopes to the current tenant, set via SET LOCAL app.current_tenant_id = ? at the start of each request.
-- Application code sets the tenant context
SET LOCAL app.current_tenant_id = 42;
-- This query now implicitly filters to tenant 42 — the RLS policy applies
SELECT id, content, embedding <=> $query AS distance
FROM document_embeddings
ORDER BY distance
LIMIT 10;
The WHERE tenant_id = ? clause is gone from the query. The database enforces it. Forgetting to filter is now impossible — the policy applies to all queries, including ORMs, raw SQL, and debugging queries run manually.
The FORCE ROW LEVEL SECURITY clause makes the policy apply to table owners too, preventing superuser-adjacent roles from bypassing it.
The ANN Performance Problem Remains
RLS fixes the correctness problem but not the ANN performance problem.
HNSW builds a graph over the entire table. When you query with an RLS policy filtering to one tenant, the database:
- Starts an ANN search from the query vector's neighborhood
- Traverses the HNSW graph
- Applies the RLS filter to each candidate
- Collects enough passing candidates to return
LIMIT kresults
For small tenants, this is expensive. The graph traversal visits many nodes from other tenants before finding enough that pass the filter.
The pgvector iterative_scan feature helps — it expands the search until enough candidates pass filtering:
SET hnsw.iterative_scan = strict_order;
SELECT id, content, embedding <=> $query AS distance
FROM document_embeddings
ORDER BY distance
LIMIT 10;
But iterative scan has a cost ceiling (hnsw.max_scan_tuples), and for very small tenants in a large shared table, even iterative scan may exhaust the ceiling before finding 10 results.
The right solution is per-tenant partial indexes:
-- One index per tenant, covering only their rows
CREATE INDEX CONCURRENTLY doc_emb_tenant_42_idx
ON document_embeddings USING hnsw (embedding vector_cosine_ops)
WHERE tenant_id = 42;
With a partial index, the ANN search only navigates the subgraph for tenant 42. It's as if they have their own dedicated index. Recall is high, latency is consistent, and there's no wasted work scanning other tenants' data.
The problem with per-tenant partial indexes is operationally managing them: creating them for new tenants, dropping them for churned tenants, rebuilding them as distributions drift, and monitoring their health.
pg_trickle Stream Tables for Per-Tenant Corpora
pg_trickle stream tables change the operational model. Instead of maintaining indexes on the shared raw table, you maintain per-tenant (or group-of-tenant) stream tables that contain only that tenant's data:
-- Per-tenant search corpus
SELECT pgtrickle.create_stream_table(
name => 'tenant_42_corpus',
query => $$
SELECT
d.id,
d.title,
d.content,
d.embedding,
u.display_name AS author,
array_agg(t.name ORDER BY t.name) AS tags,
d.created_at
FROM documents d
JOIN users u ON u.id = d.author_id
LEFT JOIN document_tags dt ON dt.document_id = d.id
LEFT JOIN tags t ON t.id = dt.tag_id
WHERE d.tenant_id = 42
AND d.published = true
GROUP BY d.id, d.title, d.content, d.embedding,
u.display_name, d.created_at
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
CREATE INDEX ON tenant_42_corpus USING hnsw (embedding vector_cosine_ops);
Now tenant_42_corpus is a real table containing only tenant 42's published documents, denormalized with author names and tags, with its own HNSW index. Queries against this table are fast because the index covers exactly the right data.
The DVM engine maintains the corpus: new documents appear within 5 seconds, tag changes propagate, unpublished documents are removed.
The Operational Challenge: Many Tenants
For 10 tenants, creating individual stream tables is fine. For 1,000 tenants, it's not.
The practical solution at scale is tiered tenancy:
Large tenants (top 5–10% by document count): individual stream tables with dedicated HNSW indexes.
Medium tenants: grouped stream tables covering 10–50 tenants per group, with partial indexes per tenant within the group.
Small tenants (long tail, few documents each): shared table with RLS and partial indexes, or (for very small tenants) exact search is fast enough and an ANN index isn't worth the overhead.
-- Tier 1: Large tenant, individual corpus
SELECT pgtrickle.create_stream_table(
name => 'corpus_tenant_42',
query => $$ ... WHERE tenant_id = 42 ... $$,
...
);
-- Tier 2: Medium tenants grouped by a hash
SELECT pgtrickle.create_stream_table(
name => 'corpus_group_07',
query => $$ ... WHERE tenant_id % 20 = 7 ... $$,
...
);
-- Plus per-tenant partial HNSW within the group
-- Tier 3: Long tail, shared table with RLS (no stream table)
-- Exact search for tiny tenants is fast enough
The tier assignment can change as tenants grow. When a medium tenant exceeds a threshold, create an individual stream table and remove them from the group.
Combining RLS With Stream Tables
Stream tables are regular PostgreSQL tables. You can apply RLS policies to them:
-- The shared "all tenants" stream table approach with RLS
SELECT pgtrickle.create_stream_table(
name => 'document_corpus',
query => $$
SELECT
d.id, d.tenant_id, d.title, d.content, d.embedding,
u.display_name AS author
FROM documents d
JOIN users u ON u.id = d.author_id
WHERE d.published = true
$$,
...
);
ALTER TABLE document_corpus ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON document_corpus
AS RESTRICTIVE FOR ALL
USING (tenant_id = current_setting('app.current_tenant_id')::bigint);
-- Per-tenant partial HNSW indexes on the stream table
CREATE INDEX corpus_tenant_42_idx ON document_corpus
USING hnsw (embedding vector_cosine_ops) WHERE tenant_id = 42;
This combines incremental maintenance (the stream table stays fresh) with database-enforced isolation (RLS) and efficient search (partial indexes per tenant).
The stream table contains all tenants' data — maintained by a single refresh cycle — but each query is automatically scoped to the current tenant by RLS, and executes using that tenant's partial index.
The USING INDEX HINT
For the query planner to use the correct partial index, the query needs to make the tenant constraint visible to the planner.
With RLS, the policy is applied after the planner generates the query plan. The planner doesn't see the tenant_id = ? constraint at plan time, so it may choose the full-table HNSW index over the partial index.
The workaround is to set the tenant context before planning, using SET LOCAL in the same transaction, and to include the tenant condition explicitly in the query:
BEGIN;
SET LOCAL app.current_tenant_id = 42;
-- Include tenant_id explicitly so the planner can choose the partial index
SELECT id, content, embedding <=> $query AS distance
FROM document_corpus
WHERE tenant_id = 42 -- explicit, even though RLS would filter anyway
ORDER BY distance
LIMIT 10;
COMMIT;
The explicit WHERE tenant_id = 42 lets the planner see that the partial index corpus_tenant_42_idx (which covers WHERE tenant_id = 42) is applicable, and choose it over the full-table index. The RLS policy remains as a safety net.
Drift-Aware Reindexing Per Tenant
With per-tenant partial indexes, drift affects each index independently. A tenant that adds 30% new documents needs their partial index rebuilt. A tenant with stable content doesn't.
-- Set drift policy on the corpus stream table
SELECT pgtrickle.alter_stream_table(
'document_corpus',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.20,
reindex_scope => 'per_partition' -- v0.38+
);
The per_partition scope applies the drift threshold per-tenant (or per-partition for partitioned tables), so only the tenants with high churn get their indexes rebuilt.
What Isolation Level You Actually Get
With this setup:
- Data isolation: RLS enforces it. Even if application code is buggy, cross-tenant reads are impossible.
- Index isolation: Partial indexes ensure ANN search stays within tenant boundaries.
- Freshness isolation: Stream table refresh applies to all tenants uniformly. A tenant with high write volume doesn't delay freshness for others.
- Schema isolation: None. All tenants share the same schema. For full schema isolation (different columns per tenant), you'd need separate tables or a tenant-specific metadata system.
The tradeoff: shared infrastructure is efficient and manageable. True schema-level isolation requires separate tables or separate database instances, which is operationally much heavier.
For most multi-tenant SaaS use cases — same schema, different data — the RLS + stream table + partial index approach provides correct isolation at operational sanity.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
From Nexmark to Production: Benchmarking Stream Processing in PostgreSQL
How pg_trickle performs on the standard streaming benchmark suite
Nexmark is to stream processing what TPC-H is to analytical databases: the standard benchmark everyone uses to compare systems. Originally developed for auction systems, it defines a set of queries over three event streams (persons, auctions, bids) that exercise different streaming patterns — windowed aggregation, joins, pattern matching, Top-N.
Flink, Kafka Streams, Spark Structured Streaming, Materialize, and RisingWave all publish Nexmark numbers. pg_trickle does too. Here's what the numbers mean and what they tell you about using PostgreSQL for stream processing.
The Nexmark Setup
Nexmark simulates an online auction system:
- Persons: New user registrations (low volume).
- Auctions: New auction listings (medium volume).
- Bids: Bids on auctions (high volume — this is the firehose).
The benchmark defines 8 queries, each testing a different streaming pattern:
| Query | Description | Pattern |
|---|---|---|
| Q0 | Pass-through | Baseline (no computation) |
| Q1 | Currency conversion | Stateless map |
| Q2 | Filter by auction ID | Stateless filter |
| Q3 | Join persons + auctions by state | Windowed join |
| Q4 | Average closing price per category | Windowed aggregation |
| Q5 | Top-5 auctions by bid count in last 10 min | Sliding window Top-N |
| Q7 | Highest bid in last 10 min | Sliding window MAX |
| Q8 | New persons who opened auctions in last 10 min | Windowed join |
Source Tables
CREATE TABLE persons (
id bigint PRIMARY KEY,
name text NOT NULL,
email text NOT NULL,
city text NOT NULL,
state text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE auctions (
id bigint PRIMARY KEY,
seller_id bigint NOT NULL REFERENCES persons(id),
category text NOT NULL,
initial_bid numeric(12,2) NOT NULL,
expires_at timestamptz NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE bids (
id bigserial PRIMARY KEY,
auction_id bigint NOT NULL REFERENCES auctions(id),
bidder_id bigint NOT NULL REFERENCES persons(id),
amount numeric(12,2) NOT NULL,
bid_at timestamptz NOT NULL DEFAULT now()
);
Stream Tables for Each Query
-- Q1: Currency conversion (stateless map)
SELECT pgtrickle.create_stream_table('nexmark_q1',
$$SELECT id, auction_id, bidder_id,
amount * 0.908 AS amount_eur,
bid_at
FROM bids$$,
schedule => '1s', refresh_mode => 'DIFFERENTIAL');
-- Q3: Join persons + auctions by state
SELECT pgtrickle.create_stream_table('nexmark_q3',
$$SELECT p.name, p.city, p.state, a.id AS auction_id
FROM persons p
JOIN auctions a ON a.seller_id = p.id
WHERE p.state IN ('OR', 'ID', 'CA')$$,
schedule => '1s', refresh_mode => 'DIFFERENTIAL');
-- Q4: Average closing price per category
SELECT pgtrickle.create_stream_table('nexmark_q4',
$$SELECT a.category,
AVG(b.amount) AS avg_final_price,
COUNT(*) AS auction_count
FROM auctions a
JOIN bids b ON b.auction_id = a.id
GROUP BY a.category$$,
schedule => '1s', refresh_mode => 'DIFFERENTIAL');
-- Q5: Top-5 auctions by bid count (sliding window)
SELECT pgtrickle.create_stream_table('nexmark_q5',
$$SELECT auction_id, COUNT(*) AS bid_count
FROM bids
WHERE bid_at >= now() - interval '10 minutes'
GROUP BY auction_id
ORDER BY bid_count DESC
LIMIT 5$$,
schedule => '1s', refresh_mode => 'DIFFERENTIAL',
temporal_mode => 'sliding_window');
The Numbers
Tested on a single-node PostgreSQL 18 instance (8 vCPU, 32GB RAM, NVMe SSD). Event generation rate: 100,000 bids/second sustained, with proportional auction and person events.
| Query | Avg refresh (ms) | P99 refresh (ms) | Throughput (events/s) | Max staleness |
|---|---|---|---|---|
| Q0 (pass-through) | 2.1 | 4.8 | 120K | 1.0s |
| Q1 (map) | 2.3 | 5.1 | 110K | 1.0s |
| Q2 (filter) | 1.8 | 3.9 | 130K | 1.0s |
| Q3 (join) | 8.4 | 18.2 | 95K | 1.0s |
| Q4 (agg + join) | 12.1 | 28.5 | 80K | 1.1s |
| Q5 (window Top-N) | 15.3 | 34.7 | 65K | 1.2s |
| Q7 (window MAX) | 6.8 | 14.1 | 100K | 1.0s |
| Q8 (window join) | 11.2 | 25.3 | 85K | 1.1s |
Throughput is the maximum sustained event ingestion rate before the scheduler falls behind (staleness exceeds the schedule interval). At 100K bids/second, all queries keep up with under 1.5 seconds of staleness.
How to Read These Numbers
vs. Flink
Flink on a 4-node cluster handles millions of events per second for Nexmark. pg_trickle on a single node handles ~100K. That's a 10× difference — but pg_trickle is running on 1/4 the hardware inside a general-purpose database, not a dedicated stream processor.
For most PostgreSQL workloads, 100K events/second is more than enough. If your application writes 1,000 orders per second (which is quite high for a single PostgreSQL instance), the stream processing overhead is negligible.
vs. Materialize
Materialize (now Redpanda-owned) is a dedicated IVM system. Its Nexmark numbers are higher than pg_trickle's because it's a standalone engine optimized for exactly this workload. But it's a separate database — your application can't use BEGIN ... INSERT ... SELECT FROM stream_table ... COMMIT in the same transaction.
vs. "Just Use a Cron Job"
The comparison that matters for most teams isn't pg_trickle vs. Flink. It's pg_trickle vs. the cron job that refreshes a materialized view every 5 minutes. That cron job scans the entire source table on every run and takes minutes to complete. pg_trickle processes only the changes and takes milliseconds.
What Nexmark Doesn't Tell You
Nexmark tests throughput under sustained load with a uniform event distribution. Production workloads are spikier and more complex:
-
Spike handling. A flash sale produces a burst of 10× normal traffic for 30 seconds. pg_trickle buffers the spike in the change tables and drains it across several refresh cycles. The staleness increases temporarily, then recovers.
-
Complex queries. Nexmark queries are relatively simple — one or two JOINs, basic aggregation. Real queries often have 4–5 JOINs, CASE expressions, nested subqueries, and HAVING clauses. More complex queries have higher per-refresh-cycle costs.
-
Concurrent reads. Nexmark measures refresh throughput, not read latency under concurrent access. pg_trickle's stream tables are regular PostgreSQL tables with MVCC — concurrent reads don't block refreshes and vice versa.
Running the Benchmark Yourself
The Nexmark benchmark is included in pg_trickle's test suite:
# Build the E2E Docker image (includes pg_trickle)
just build-e2e-image
# Run Nexmark queries
cargo test --test e2e_tpch_tests -- --ignored nexmark --test-threads=1 --nocapture
# Control the event generation rate and duration
NEXMARK_EVENTS_PER_SEC=50000 NEXMARK_DURATION_SEC=60 \
cargo test --test e2e_tpch_tests -- --ignored nexmark --test-threads=1 --nocapture
The benchmark reports per-query throughput, latency percentiles, and the maximum sustainable event rate.
The Bottom Line
pg_trickle isn't trying to replace Flink or Kafka Streams for large-scale stream processing. It's offering stream processing capabilities to teams that are already running PostgreSQL and don't want to operate a second system.
If your event rate is under 100K/second and you want sub-second freshness, pg_trickle handles it inside your existing database with no additional infrastructure. If you need millions of events per second across a distributed cluster, use a dedicated stream processor.
For most applications — the ones with hundreds to tens of thousands of writes per second — pg_trickle's Nexmark numbers are more than sufficient.
← Back to Blog Index | Documentation
How to Change a Stream Table Query Without Taking It Offline
Online schema evolution for incremental views
Your stream table has been running in production for three months. The business wants a new column. Or a different aggregation. Or the source table schema changed and your JOIN needs updating.
With a materialized view, you'd DROP the old one and CREATE a new one. During the window between drop and create, any query against the view fails.
With a naive stream table approach, you'd drop the stream table, losing the CDC triggers, change buffers, and refresh history, then recreate it from scratch. During the initial full refresh — which can take minutes for large tables — the stream table either doesn't exist or contains stale data.
pg_trickle's ALTER STREAM TABLE ... QUERY does this online. The stream table stays queryable throughout the migration.
The Simple Case: Adding a Column
-- Original stream table
SELECT pgtrickle.create_stream_table(
'order_summary',
$$SELECT customer_id, SUM(amount) AS total, COUNT(*) AS cnt
FROM orders GROUP BY customer_id$$,
schedule => '3s', refresh_mode => 'DIFFERENTIAL'
);
-- Later: add average order value
SELECT pgtrickle.alter_stream_table(
'order_summary',
query => $$SELECT customer_id,
SUM(amount) AS total,
COUNT(*) AS cnt,
AVG(amount) AS avg_amount
FROM orders GROUP BY customer_id$$
);
What happens internally:
- pg_trickle parses the new query and builds a new operator tree.
- It compares the new schema to the current storage table schema.
- It adds the
avg_amountcolumn to the storage table (anALTER TABLE ADD COLUMN— non-blocking in PostgreSQL). - It triggers a full refresh to populate the new column.
- It updates the CDC triggers if the source table set changed.
- It updates the catalog entry with the new query and operator tree.
During steps 1–6, the old data in the stream table is still queryable. The avg_amount column is NULL until the full refresh completes. After the refresh, all rows have the correct value. The switch is atomic — the refresh runs in a single transaction.
Changing the Aggregation
More complex changes — altering the GROUP BY, changing JOINs, or restructuring the query — trigger a full schema migration:
-- Change grouping from per-customer to per-region
SELECT pgtrickle.alter_stream_table(
'order_summary',
query => $$SELECT c.region,
SUM(o.amount) AS total,
COUNT(*) AS cnt
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY c.region$$
);
This is a bigger change: the GROUP BY columns are different, a new source table (customers) is introduced, and the row identity changes. pg_trickle handles this by:
- Creating a new storage table with the new schema.
- Running a full refresh to populate it.
- Atomically swapping the old storage table for the new one (renaming under an exclusive lock held for microseconds).
- Dropping the old storage table.
- Updating CDC triggers to include
customers.
During the full refresh (step 2), the old stream table is still serving reads. The swap in step 3 is nearly instantaneous. From the application's perspective, the stream table name never changes — it just starts returning rows with the new schema.
Changing the Schedule or Refresh Mode
These are lighter operations that don't require a schema migration:
-- Speed up the refresh cycle
SELECT pgtrickle.alter_stream_table('order_summary', schedule => '1 second');
-- Switch from DIFFERENTIAL to IMMEDIATE
SELECT pgtrickle.alter_stream_table('order_summary', refresh_mode => 'IMMEDIATE');
Schedule changes take effect on the next scheduler cycle. Refresh mode changes may trigger a one-time full refresh to ensure the stream table is consistent with the new mode's requirements.
What About Downstream Dependencies?
If order_summary has downstream stream tables (other stream tables that reference it), the schema migration cascades correctly:
-- gold_dashboard depends on order_summary
SELECT pgtrickle.create_stream_table(
'gold_dashboard',
$$SELECT region, SUM(total) AS grand_total
FROM order_summary GROUP BY region$$,
schedule => '5s', refresh_mode => 'DIFFERENTIAL'
);
When you alter order_summary's query, pg_trickle checks whether the schema change is compatible with downstream dependents. If the columns referenced by gold_dashboard still exist with compatible types, the downstream stream table continues working. If not, pg_trickle raises an error and tells you which downstream tables need updating:
ERROR: altering query for "order_summary" would break dependent stream table
"gold_dashboard": column "total" would be removed.
HINT: Drop or alter "gold_dashboard" first, or include column "total" in the new query.
No silent breakage. No orphaned stream tables pointing at columns that don't exist.
Suspending and Resuming
If you need to make multiple changes atomically — alter the query, update the schedule, and adjust the refresh mode — you can suspend the stream table first:
-- Pause refresh cycles
SELECT pgtrickle.alter_stream_table('order_summary', status => 'SUSPENDED');
-- Make changes
SELECT pgtrickle.alter_stream_table('order_summary',
query => $$ ... new query ... $$,
schedule => '2s',
refresh_mode => 'DIFFERENTIAL'
);
-- Resume
SELECT pgtrickle.resume_stream_table('order_summary');
While suspended, the change buffers continue accumulating (CDC triggers still fire). When you resume, the next refresh cycle drains all buffered changes. No data is lost during the suspension.
The Export/Import Workflow
For changes that require careful validation, you can export the stream table's definition, edit it, and reimport:
-- Export current definition as a SQL statement
SELECT pgtrickle.export_definition('order_summary');
This returns the full create_stream_table call with all current parameters. You can modify it in your migration script, test it in a staging environment, and apply it in production.
When You Do Need Downtime
There's one scenario where online migration isn't possible: if you need to change the stream table's name. PostgreSQL table renames require an exclusive lock, and pg_trickle's CDC triggers reference the stream table by OID, not name. Renaming requires dropping and recreating.
For everything else — query changes, schedule changes, refresh mode changes, adding or removing columns, changing JOINs, changing GROUP BY — alter_stream_table handles it online.
← Back to Blog Index | Documentation
The Outbox Pattern, Turbocharged
Transactionally Consistent Event Emission from PostgreSQL Without Dual-Write
The outbox pattern exists because distributed systems lie.
The lie: "I'll save the record to the database and send the event to Kafka in the same operation." The reality: one of those can fail while the other succeeds. You commit the database record but the Kafka send times out. Or the Kafka send succeeds but your process crashes before the database commits. Either way, downstream systems have a different view of the world than your database.
The outbox pattern is the solution: write the event to a outbox table in the same transaction as the business record. A separate process reads the outbox and publishes to Kafka/SQS/whatever. The outbox is durable because it's in PostgreSQL. The publication is idempotent because you can retry. Exactly-once-delivery becomes achievable.
This pattern is well-understood. It's implemented by Debezium, AWS EventBridge Pipes, and a dozen ORM libraries. It works.
What it doesn't do: maintain the outbox entries incrementally as derived data changes. If the event you need to emit is "the customer's order total changed" — a derived aggregate — you need to compute that aggregate, write it to the outbox, and keep it fresh as orders change.
This is where pg_trickle stream tables change the model.
The Standard Outbox and Its Limitations
The standard outbox implementation:
CREATE TABLE outbox (
id bigserial PRIMARY KEY,
aggregate_type text NOT NULL, -- e.g., 'Order', 'Customer'
aggregate_id bigint NOT NULL,
event_type text NOT NULL, -- e.g., 'OrderPlaced', 'TotalUpdated'
payload jsonb NOT NULL,
created_at timestamptz DEFAULT NOW(),
published_at timestamptz,
published boolean DEFAULT false
);
-- Application code
BEGIN;
INSERT INTO orders (customer_id, total, ...) VALUES (...);
INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
VALUES ('Order', $new_order_id, 'OrderPlaced', $payload::jsonb);
COMMIT;
The outbox relay picks up unpublished rows, sends them, marks them published. Clean and correct.
Limitation 1: Complex derived events require application logic.
Emitting an OrderTotalChanged event when an order's total changes due to an item update requires detecting the change, computing the new total, constructing the event, and writing it to the outbox — all in the application layer. This logic is duplicated for every code path that can change an order's total.
Limitation 2: Aggregate-level events require aggregation.
Emitting a CustomerRevenueUpdated event — the customer's total revenue across all orders — requires aggregating in the application, which means either an extra SELECT or maintaining the aggregate as denormalized state. If the aggregate is maintained via triggers, you're back to the trigger-maintenance problem.
Limitation 3: Multi-table derived events are fragile.
If the event payload should include denormalized fields from related tables (shipping address, product name, account tier), the application must join them at event creation time. This works but creates tight coupling between the event schema and the application code at write time.
Stream Tables as Event Sources
pg_trickle stream tables are maintained tables. When a stream table row changes, the change is a computable event. You can attach an outbox relay directly to stream table changes.
-- A stream table that computes customer-level revenue state
SELECT pgtrickle.create_stream_table(
name => 'customer_revenue_state',
query => $$
SELECT
c.id AS customer_id,
c.email,
c.account_tier,
COUNT(o.id) AS order_count,
SUM(o.total) AS total_revenue,
MAX(o.created_at) AS last_order_at
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.id, c.email, c.account_tier
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
Every time a customer's order count or total revenue changes — whether from a new order, a refund, or a cancelled order — the corresponding row in customer_revenue_state is updated with the new aggregate values.
Now you can attach an outbox relay that watches for changes in customer_revenue_state and emits events:
-- Attach an outbox relay to the stream table
SELECT pgtrickle.create_outbox_relay(
stream_table => 'customer_revenue_state',
outbox_table => 'outbox',
aggregate_type => 'Customer',
aggregate_id_column => 'customer_id',
event_type_fn => $$
CASE
WHEN OLD.order_count IS NULL THEN 'CustomerFirstOrderPlaced'
WHEN NEW.order_count > OLD.order_count THEN 'CustomerOrderAdded'
WHEN NEW.total_revenue < OLD.total_revenue THEN 'CustomerRefundProcessed'
ELSE 'CustomerRevenueUpdated'
END
$$,
payload_fn => $$
jsonb_build_object(
'customer_id', NEW.customer_id,
'email', NEW.email,
'account_tier', NEW.account_tier,
'order_count', NEW.order_count,
'total_revenue', NEW.total_revenue,
'last_order_at', NEW.last_order_at,
'revenue_delta', NEW.total_revenue - COALESCE(OLD.total_revenue, 0)
)
$$
);
When customer_revenue_state is updated by a refresh cycle, pg_trickle writes the corresponding event rows to outbox as part of the same transaction. Your existing outbox relay picks them up and publishes.
The derived aggregate — total revenue, order count — is computed once, by the DVM engine, not duplicated across every application code path that can change orders.
The Mechanics
The outbox relay works at the stream table layer:
- A refresh cycle computes the delta to
customer_revenue_state— a set of rows to insert, update, or delete. - For each changed row, the relay evaluates
event_type_fnandpayload_fnagainst(OLD.*, NEW.*). - The relay writes one
outboxrow per changed stream table row. - Steps 2–3 happen inside the same transaction that applies the delta.
This is transactionally safe: if the refresh cycle transaction rolls back (e.g., due to a conflict), the outbox entries are also rolled back. The outbox never contains events for changes that didn't commit.
The OLD.* and NEW.* semantics let you express transitions:
OLD IS NULL→ this is a new row (first-time state)NEW IS NULL→ this row was deletedNEW.total_revenue > OLD.total_revenue→ revenue increasedNEW.order_count > OLD.order_count→ new order placed
This is richer than a raw table trigger, which fires on every write to orders — including changes that don't affect the customer's aggregate state. The stream table relay fires only when the aggregate state actually changes.
Enriched Event Payloads
A frequent frustration with the standard outbox: the payload at write time might not contain enough context for downstream consumers. "OrderPlaced" fires, but the consumer needs to know the order total, the product names, the customer's account tier. The application at write time may not have all that information ready.
With stream tables as event sources, the event payload is computed from the fully-denormalized stream table. You can include anything the stream table contains:
SELECT pgtrickle.create_stream_table(
name => 'order_event_state',
query => $$
SELECT
o.id AS order_id,
o.status,
o.total,
o.created_at,
c.id AS customer_id,
c.email AS customer_email,
c.account_tier,
s.name AS shipping_address_name,
s.city,
s.country,
array_agg(jsonb_build_object(
'product_id', p.id,
'product_name', p.name,
'quantity', oi.quantity,
'unit_price', oi.unit_price
) ORDER BY oi.id) AS line_items
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN shipping_addresses s ON s.id = o.shipping_address_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
GROUP BY o.id, o.status, o.total, o.created_at,
c.id, c.email, c.account_tier,
s.name, s.city, s.country
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
The event payload includes the full order with line items, customer context, and shipping address — all pre-joined. The downstream consumer receives a self-contained event and doesn't need to query back to the database.
This is the "fat event" pattern: events that carry all the context consumers need, rather than just an ID that consumers have to resolve.
The Deduplication Story
Outbox patterns require consumers to handle duplicate delivery (at-least-once delivery is the standard guarantee). When stream tables drive the outbox, there's an additional source of potential duplicates to think about.
If a customer places three orders in the same 10-second refresh cycle, the stream table will update once — to the aggregate state after all three orders. The outbox will have one event, reflecting the final state.
If the outbox relay fails to publish and the refresh worker retries (applying the same delta again), the outbox will have a duplicate row for the same state. The consumer's deduplication logic (typically, "have I seen an event with this aggregate_id and this state before?") handles this correctly.
The key insight: stream tables aggregate changes within a refresh cycle, so the outbox volume is naturally lower than it would be with row-level triggers on the source tables. Three new orders in 10 seconds produce one CustomerRevenueUpdated event, not three.
Monitoring Outbox Health
-- Outbox relay status
SELECT
relay_name,
stream_table,
events_emitted_last_cycle,
avg_emit_ms,
outbox_pending_count,
oldest_pending_age_secs
FROM pgtrickle.outbox_relay_status();
-- Are events being published fast enough?
SELECT
COUNT(*) FILTER (WHERE published = false) AS pending,
COUNT(*) FILTER (WHERE published = false
AND created_at < NOW() - INTERVAL '5 minutes') AS old_pending,
MIN(created_at) FILTER (WHERE published = false) AS oldest_pending_at
FROM outbox;
The outbox_pending_count growing over time indicates the outbox relay can't keep up with the event rate. Common causes: slow downstream (Kafka backpressure, API rate limits), slow database (outbox table index needs VACUUM or REINDEX), or the relay is down.
When to Use This Pattern
The stream-table-backed outbox is the right choice when:
- The event you need to emit is a derived or aggregated value, not a raw row change
- The event payload benefits from denormalization (fat events)
- You want to decouple the event schema from the application write path
- Multiple code paths affect the same aggregate, and you want one canonical event source
It's overkill when:
- The event maps directly to a raw table change (e.g.,
UserCreatedfires when a row is inserted intousers) - You need sub-second event latency (stream tables add up to
schedulelatency) - You're doing event sourcing where every intermediate state matters (stream tables coalesce changes within a cycle)
For raw table events, a standard trigger-based outbox is simpler. For derived aggregate events, the stream-table-backed outbox is significantly more correct and easier to maintain.
The Bigger Picture
The transactional outbox pattern solves the dual-write problem. Stream tables solve the derived-data freshness problem. Combined, they solve a third problem: emitting events that reflect aggregate state changes in a way that's both transactionally safe and computationally efficient.
Without this combination, teams typically end up with either:
- Events that fire on raw table changes and require consumers to compute aggregates (chatty, couples consumers to the database schema)
- Scheduled aggregate recomputation with a separate event emission step (latency, correctness concerns at the boundary)
The stream-table outbox gives you: aggregate events, computed correctly by the DVM engine, emitted transactionally, with the freshness controlled by your schedule parameter.
The reliability of the outbox pattern, the correctness of IVM, and the expressiveness of SQL. That combination is harder to build than it sounds, but pg_trickle makes it a configuration decision rather than an engineering project.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Parameterized Stream Tables: Building a SQL View Library
Patterns for reusable, tenant-scoped, and versionable stream table definitions in multi-tenant schemas
As your pg_trickle deployment grows, you'll notice a pattern: many stream tables have the same structure, differing only in a filter value. "Revenue by product for tenant A" and "revenue by product for tenant B" are the same query with different WHERE clauses. "Daily orders for the US region" and "daily orders for the EU region" are the same aggregation scoped to different subsets.
The naive approach — creating a separate stream table per tenant or per segment — works but doesn't scale. With 500 tenants, you have 500 nearly identical stream tables, 500 sets of CDC triggers, and a DAG with 500 nodes that are logically the same computation. Changes to the aggregation logic require modifying 500 stream table definitions.
This post explores patterns for building reusable, parameterized stream table definitions that serve multiple tenants or segments efficiently, while maintaining the per-tenant isolation that applications expect.
Pattern 1: Single Stream Table with Tenant Column
The simplest and most efficient pattern is a single stream table that includes the tenant or segment as a grouping column:
SELECT pgtrickle.create_stream_table(
'revenue_by_tenant_product',
$$
SELECT
tenant_id,
product_category,
date_trunc('day', ordered_at) AS day,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
GROUP BY tenant_id, product_category, date_trunc('day', ordered_at)
$$
);
All tenants share a single stream table. When tenant A places an order, only their group is updated. Tenant B's rows are untouched. The incremental engine is already group-aware — it processes deltas per group, so adding tenant_id as a grouping column doesn't change the fundamental cost model.
Applications filter at query time:
-- Application query (per-tenant dashboard)
SELECT product_category, day, revenue, order_count
FROM revenue_by_tenant_product
WHERE tenant_id = 'tenant_abc'
ORDER BY day DESC;
With an index on (tenant_id, day), this is a fast point lookup. No sequential scan, no cross-tenant data leakage.
Advantages:
- Single stream table to manage (one definition, one DAG node)
- Shared CDC infrastructure (one trigger per source table)
- Efficient index-based tenant isolation at query time
- Adding a new tenant requires no schema changes
Disadvantages:
- All tenants must use the same aggregation logic
- Row-level security (RLS) adds a performance tax on reads
- Very large tenants and very small tenants share the same refresh cadence
Pattern 2: Template Functions for Tenant-Specific Tables
When tenants need different schemas or different refresh cadences, you can use a template function that generates tenant-specific stream tables:
-- Template function: creates a per-tenant stream table
CREATE OR REPLACE FUNCTION create_tenant_analytics(p_tenant_id text)
RETURNS void AS $$
BEGIN
PERFORM pgtrickle.create_stream_table(
'analytics_' || p_tenant_id,
format(
'SELECT
product_category,
date_trunc(''day'', ordered_at) AS day,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
WHERE tenant_id = %L
GROUP BY product_category, date_trunc(''day'', ordered_at)',
p_tenant_id
)
);
END;
$$ LANGUAGE plpgsql;
-- Create analytics for a new tenant
SELECT create_tenant_analytics('tenant_abc');
SELECT create_tenant_analytics('tenant_xyz');
Each tenant gets their own stream table with a WHERE filter. pg_trickle's incremental engine evaluates each stream table's filter against incoming changes, so an order for tenant_abc only triggers a refresh on analytics_tenant_abc — not on analytics_tenant_xyz.
Advantages:
- Per-tenant refresh cadence and configuration
- Per-tenant table permissions (no RLS needed)
- Tenants can have different schemas if needed
Disadvantages:
- More DAG nodes (one per tenant)
- More CDC overhead (each source table change is evaluated against all tenant filters)
- Schema management complexity (updates require iterating over all tenants)
Pattern 3: Schema-Per-Tenant Isolation
For strict multi-tenant isolation (common in B2B SaaS with compliance requirements), each tenant has their own schema:
-- Tenant provisioning: create schema and stream tables
CREATE OR REPLACE FUNCTION provision_tenant(p_tenant text)
RETURNS void AS $$
BEGIN
EXECUTE format('CREATE SCHEMA IF NOT EXISTS %I', p_tenant);
EXECUTE format(
'CREATE TABLE %I.orders (
id serial PRIMARY KEY,
product_category text,
amount numeric,
ordered_at timestamptz DEFAULT now()
)', p_tenant
);
-- Stream table in tenant schema
PERFORM pgtrickle.create_stream_table(
p_tenant || '.daily_revenue',
format(
'SELECT
product_category,
date_trunc(''day'', ordered_at) AS day,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM %I.orders
GROUP BY product_category, date_trunc(''day'', ordered_at)',
p_tenant
)
);
END;
$$ LANGUAGE plpgsql;
Each tenant is completely isolated — different schemas, different source tables, different stream tables. pg_trickle manages each independently. This is the most isolated but also the most resource-intensive pattern.
Pattern 4: Versioned Definitions
As your analytics evolve, you need to update stream table definitions without breaking existing consumers. A versioning pattern:
-- Version 1: basic revenue aggregation
SELECT pgtrickle.create_stream_table(
'revenue_v1',
$$
SELECT
tenant_id,
date_trunc('day', ordered_at) AS day,
SUM(amount) AS revenue
FROM orders
GROUP BY tenant_id, date_trunc('day', ordered_at)
$$
);
-- Version 2: adds product category and order count
SELECT pgtrickle.create_stream_table(
'revenue_v2',
$$
SELECT
tenant_id,
product_category,
date_trunc('day', ordered_at) AS day,
SUM(amount) AS revenue,
COUNT(*) AS order_count,
AVG(amount) AS avg_order_value
FROM orders
GROUP BY tenant_id, product_category, date_trunc('day', ordered_at)
$$
);
Both versions coexist. Applications on the old API continue reading from revenue_v1. New features use revenue_v2. Both are maintained incrementally from the same source table. Once all consumers have migrated, drop v1.
For more sophisticated versioning, use views as the public interface:
-- Public interface: a view that points to the current version
CREATE VIEW revenue_current AS SELECT * FROM revenue_v2;
-- When v3 is ready:
-- 1. Create revenue_v3 stream table
-- 2. ALTER VIEW revenue_current AS SELECT * FROM revenue_v3;
-- 3. Drop revenue_v2 after migration period
Pattern 5: Composable Building Blocks
Build a library of reusable intermediate stream tables that serve as building blocks for application-specific views:
-- Building block 1: Order facts (shared by all analytics)
SELECT pgtrickle.create_stream_table(
'order_facts',
$$
SELECT
o.id AS order_id,
o.tenant_id,
o.customer_id,
c.segment AS customer_segment,
o.product_category,
o.amount,
o.ordered_at,
date_trunc('day', o.ordered_at) AS order_day,
date_trunc('week', o.ordered_at) AS order_week,
date_trunc('month', o.ordered_at) AS order_month
FROM orders o
JOIN customers c ON c.id = o.customer_id
$$
);
-- Building block 2: Daily aggregates (built on order_facts)
SELECT pgtrickle.create_stream_table(
'daily_metrics',
$$
SELECT
tenant_id,
product_category,
customer_segment,
order_day,
SUM(amount) AS revenue,
COUNT(*) AS orders,
COUNT(DISTINCT customer_id) AS unique_customers
FROM order_facts
GROUP BY tenant_id, product_category, customer_segment, order_day
$$
);
-- Application view: monthly trends (built on daily_metrics)
SELECT pgtrickle.create_stream_table(
'monthly_trends',
$$
SELECT
tenant_id,
product_category,
date_trunc('month', order_day) AS month,
SUM(revenue) AS monthly_revenue,
SUM(orders) AS monthly_orders,
SUM(unique_customers) AS monthly_customers
FROM daily_metrics
GROUP BY tenant_id, product_category, date_trunc('month', order_day)
$$
);
The DAG is: orders → order_facts → daily_metrics → monthly_trends. Each layer adds aggregation. New application views can be built on any layer without touching the layers below. Need "weekly revenue by segment"? Build it on daily_metrics. Need "top customers by lifetime value"? Build it on order_facts.
Pattern 6: Configuration-Driven Definitions
For platforms that offer self-service analytics (users define their own dashboards), store stream table definitions as configuration:
-- Stream table definition registry
CREATE TABLE stream_table_registry (
id serial PRIMARY KEY,
name text NOT NULL UNIQUE,
tenant_id text,
definition text NOT NULL, -- SQL query
refresh_mode text DEFAULT 'DEFERRED',
version integer DEFAULT 1,
created_at timestamptz DEFAULT now(),
is_active boolean DEFAULT true
);
-- Deploy a definition
CREATE OR REPLACE FUNCTION deploy_stream_definition(p_name text)
RETURNS void AS $$
DECLARE
v_def record;
BEGIN
SELECT * INTO v_def FROM stream_table_registry WHERE name = p_name AND is_active;
IF NOT FOUND THEN RAISE EXCEPTION 'Definition not found: %', p_name; END IF;
-- Drop existing if version changed
PERFORM pgtrickle.drop_stream_table(p_name);
PERFORM pgtrickle.create_stream_table(p_name, v_def.definition);
PERFORM pgtrickle.alter_stream_table(p_name, refresh_mode := v_def.refresh_mode);
END;
$$ LANGUAGE plpgsql;
This pattern enables infrastructure-as-code for stream tables. Definitions are stored in a registry, versioned, and deployed via function calls. CI/CD pipelines can manage stream table deployments the same way they manage schema migrations.
Row-Level Security for Shared Tables
When multiple tenants share a single stream table, PostgreSQL's RLS provides the isolation:
-- Enable RLS on the shared stream table
ALTER TABLE revenue_by_tenant_product ENABLE ROW LEVEL SECURITY;
-- Policy: each role can only see their tenant's data
CREATE POLICY tenant_isolation ON revenue_by_tenant_product
FOR SELECT
USING (tenant_id = current_setting('app.current_tenant'));
Applications set app.current_tenant at connection time, and RLS transparently filters results. The stream table itself contains all tenants' data (efficient for incremental maintenance), but each tenant only sees their own rows.
Choosing the Right Pattern
| Scenario | Recommended pattern |
|---|---|
| 10–50 tenants, same analytics | Pattern 1 (single table + tenant column) |
| 50–500 tenants, same analytics | Pattern 1 + RLS |
| Tenants with different schemas | Pattern 2 (template functions) |
| Strict compliance isolation | Pattern 3 (schema-per-tenant) |
| Evolving analytics definitions | Pattern 4 (versioned) |
| Complex analytics stack | Pattern 5 (composable blocks) |
| Self-service analytics platform | Pattern 6 (config-driven) |
Most applications should start with Pattern 1. It's the simplest, most efficient, and handles the majority of multi-tenant use cases. Move to more complex patterns only when the requirements demand it.
Stream tables don't have to be one-off definitions. Build a library of composable, versioned, tenant-aware analytics that grow with your product — without multiplicative infrastructure costs.
← Back to Blog Index | Documentation
pg_trickle on CloudNativePG
Running Incremental View Maintenance in Production Kubernetes
Running a stateful PostgreSQL extension in Kubernetes requires more thought than deploying a stateless web service. The extension has background workers, shared memory segments, and per-database state. Upgrades must be coordinated with the schema migration process. High availability means the extension must survive primary failover.
This post covers running pg_trickle on CloudNativePG — the CNCF-sandbox Kubernetes operator for PostgreSQL — in a production setup. The focus is on the operational mechanics: getting the extension installed, keeping it healthy across restarts and failovers, and monitoring it from outside the database.
Prerequisites
- Kubernetes 1.27+
- CloudNativePG operator 1.24+ installed in the cluster
- A container image with both PostgreSQL 18 and pg_trickle installed
- The pg_trickle shared library preloaded
The Container Image
CloudNativePG manages the PostgreSQL binary and data directory lifecycle. You need to provide a container image that includes the extension.
A minimal Dockerfile:
FROM ghcr.io/cloudnative-pg/postgresql:18
# Install build dependencies
USER root
RUN apt-get update && apt-get install -y \
build-essential \
postgresql-server-dev-18 \
git \
curl \
pkg-config \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Rust (required to build pg_trickle)
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
# Install pgrx
RUN cargo install cargo-pgrx --version 0.18.0 && \
cargo pgrx init --pg18 /usr/lib/postgresql/18/bin/pg_config
# Build and install pg_trickle
ARG PGTRICKLE_VERSION=0.36.0
RUN git clone --depth 1 --branch v${PGTRICKLE_VERSION} \
https://github.com/trickle-labs/pg-trickle.git /tmp/pg-trickle && \
cd /tmp/pg-trickle && \
cargo pgrx install --release --pg-config /usr/lib/postgresql/18/bin/pg_config && \
rm -rf /tmp/pg-trickle
USER 26
For production, pin to a specific digest rather than a tag. Build the image in CI and push to your internal registry.
The Cluster Manifest
A production-ready CNPG Cluster resource:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pgtrickle-cluster
namespace: production
spec:
instances: 3
imageName: your-registry/postgresql-pgtrickle:18-0.36.0
postgresql:
parameters:
# Required: load pg_trickle's shared library
shared_preload_libraries: "pg_trickle"
# pg_trickle configuration
pg_trickle.enabled: "on"
pg_trickle.max_parallel_workers: "4"
pg_trickle.backpressure_enabled: "on"
pg_trickle.backpressure_max_lag_mb: "128"
pg_trickle.log_format: "json"
# Standard PostgreSQL tuning
shared_buffers: "2GB"
effective_cache_size: "6GB"
maintenance_work_mem: "512MB"
work_mem: "64MB"
max_connections: "200"
wal_level: "logical" # required for pg_trickle's WAL features
max_wal_senders: "10"
max_replication_slots: "10"
bootstrap:
initdb:
database: app
owner: app
postInitSQL:
# Install extension in the target database
- CREATE EXTENSION IF NOT EXISTS pg_trickle;
# Optionally install pgvector if using vector features
- CREATE EXTENSION IF NOT EXISTS vector;
storage:
size: 100Gi
storageClass: fast-ssd # Use NVMe-backed storage for pg_trickle workloads
resources:
requests:
memory: "8Gi"
cpu: "4"
limits:
memory: "16Gi"
cpu: "8"
# Monitoring integration
monitoring:
enablePodMonitor: true
customQueriesConfigMap:
- name: pgtrickle-metrics
key: queries.yaml
The critical parameters:
shared_preload_libraries: "pg_trickle"— required for the background worker to startwal_level: "logical"— required for pg_trickle's WAL decoderpostInitSQL— runs once during database creation, installs the extension
High Availability and Failover
CNPG runs one primary and N-1 standbys. When the primary fails, CNPG promotes a standby and updates the service endpoints. Your application reconnects to the new primary.
pg_trickle's background workers run only on the primary. On a standby:
- The extension code is present (the shared library loads fine)
- The pg_trickle catalog tables are replicated
- The background workers don't start (the standby is read-only)
On failover:
- CNPG promotes a standby to primary
- PostgreSQL starts up
- pg_trickle's background workers start as part of the
shared_preload_librariesinitialization - The workers load the stream table catalog and begin processing any pending change buffer entries
The change buffers are regular PostgreSQL tables, replicated via streaming replication. Changes captured before the failover are in the buffers on the new primary and will be processed after startup.
Expected staleness on failover: Up to the maximum of:
- The change buffer accumulation during the failover window (typically 10–60 seconds)
- The stream table's configured
schedule
For a schedule = '5 seconds' stream table, expect up to ~60 seconds of staleness after failover on a fast cluster.
Configuration Management with ConfigMaps
Rather than hardcoding GUC values in the Cluster manifest, use a ConfigMap for pg_trickle settings and reference it:
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pgtrickle-config
data:
pg_trickle.enabled: "on"
pg_trickle.max_parallel_workers: "4"
pg_trickle.backpressure_enabled: "on"
pg_trickle.backpressure_max_lag_mb: "128"
pg_trickle.default_schedule: "5 seconds"
pg_trickle.log_format: "json"
# In the Cluster spec
spec:
postgresql:
parameters:
shared_preload_libraries: "pg_trickle"
configMapRef:
name: pgtrickle-config
This lets you update GUC values via kubectl apply without modifying the Cluster resource, and enables configuration review in git via normal PR workflow.
Prometheus Metrics
pg_trickle exposes metrics via the PostgreSQL query interface. Export them with a custom queries ConfigMap for the CNPG PodMonitor:
# pgtrickle-metrics.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pgtrickle-metrics
data:
queries.yaml: |
pg_trickle_stream_tables:
query: |
SELECT
name,
sla_tier,
EXTRACT(EPOCH FROM (NOW() - last_refresh_at))::float AS staleness_seconds,
rows_changed_last_cycle,
avg_refresh_ms / 1000.0 AS avg_refresh_seconds,
pending_change_rows
FROM pgtrickle.stream_table_status()
metrics:
- name:
usage: LABEL
description: Stream table name
- sla_tier:
usage: LABEL
description: SLA tier
- staleness_seconds:
usage: GAUGE
description: Seconds since last refresh
- rows_changed_last_cycle:
usage: GAUGE
description: Rows changed in last refresh cycle
- avg_refresh_seconds:
usage: GAUGE
description: Average refresh duration in seconds
- pending_change_rows:
usage: GAUGE
description: Pending rows in change buffer
pg_trickle_change_buffers:
query: |
SELECT
source_table,
pending_rows,
EXTRACT(EPOCH FROM (NOW() - oldest_change_at))::float AS backlog_age_seconds
FROM pgtrickle.change_buffer_status()
metrics:
- source_table:
usage: LABEL
- pending_rows:
usage: GAUGE
description: Pending rows in change buffer for this source
- backlog_age_seconds:
usage: GAUGE
description: Age of the oldest pending change in seconds
With this, Grafana can visualize stream table staleness per SLA tier, change buffer depth per source table, and refresh timing trends.
Alerting Rules
Essential Prometheus alerts:
groups:
- name: pgtrickle
rules:
- alert: StreamTableCriticalStaleness
expr: |
pg_trickle_stream_tables_staleness_seconds{sla_tier="critical"} > 30
for: 2m
labels:
severity: critical
annotations:
summary: "Critical stream table {{ $labels.name }} is {{ $value | humanizeDuration }} stale"
- alert: StreamTableStandardStaleness
expr: |
pg_trickle_stream_tables_staleness_seconds{sla_tier="standard"} > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Standard stream table {{ $labels.name }} is {{ $value | humanizeDuration }} stale"
- alert: ChangeBufferBacklog
expr: |
pg_trickle_change_buffers_backlog_age_seconds > 600
for: 5m
labels:
severity: warning
annotations:
summary: "Change buffer for {{ $labels.source_table }} has {{ $value | humanizeDuration }} backlog"
The critical staleness alert fires before users notice. The backlog alert fires when the refresh workers can't keep up with the write load.
Upgrading pg_trickle
Upgrading pg_trickle follows the standard extension upgrade pattern, coordinated with CNPG's rolling update mechanism.
Step 1: Build the new image
Build a new container image with the upgraded extension version. Push to your registry.
Step 2: Update the Cluster manifest
spec:
imageName: your-registry/postgresql-pgtrickle:18-0.37.0 # new version
Step 3: CNPG performs a rolling update
CNPG stops the old primary, starts a new one with the new image, and promotes it. The standby instances update one at a time.
Step 4: Run the extension migration
After the cluster is running the new image, the extension schema may need to be upgraded:
ALTER EXTENSION pg_trickle UPDATE TO '0.37.0';
For CNPG, this is most cleanly done as a postUpgradeSQL hook in the Cluster spec, or as a Job that runs after the rolling update completes.
Sizing Guidance
pg_trickle's resource consumption scales with:
- Number of stream tables
- Write volume on source tables (drives change buffer size and refresh frequency)
- Complexity of stream table queries (drives delta computation cost)
For a typical deployment with 10–20 stream tables and moderate write volume:
| Resource | Minimum | Recommended |
|---|---|---|
| CPU (cores) | 2 | 4–8 |
| Memory | 4GB | 8–16GB |
| Storage IOPS | 3,000 | 10,000+ (NVMe preferred) |
max_parallel_workers | 2 | 4 |
shared_buffers | 1GB | 25% of RAM |
maintenance_work_mem | 256MB | 1–2GB (for REINDEX operations) |
The most important factor is storage latency. pg_trickle's refresh cycles are I/O-bound when processing large deltas or maintaining HNSW indexes. SSDs (NVMe preferred) make the difference between 20ms refresh cycles and 200ms ones.
Production Checklist
Before going live with pg_trickle on CNPG:
-
shared_preload_librariesincludespg_trickle -
wal_level = logicalis set -
Extension installed in target database via
postInitSQL - Custom metrics ConfigMap deployed and referenced in Cluster manifest
- Prometheus alerts configured (staleness, backlog)
- Grafana dashboard imported
- SLA tiers configured for all stream tables
-
Failover tested:
kubectl cnpg promoteto simulate primary failure, verify stream tables resume within expected window -
Upgrade path tested in staging: new image +
ALTER EXTENSION UPDATE - Backpressure enabled and threshold tuned for your write volume
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Making pg_trickle Work Through PgBouncer
Connection pooling, session state, and the gotchas nobody warns you about
PgBouncer is the standard connection pooler for PostgreSQL. It sits between your application and the database, multiplexing hundreds or thousands of application connections onto a smaller pool of database connections.
pg_trickle works with PgBouncer. But there are configuration requirements that, if you get wrong, produce confusing failures — stream tables that don't refresh, LISTEN/NOTIFY that doesn't fire, and CDC triggers that silently miss changes.
Here's what you need to know.
Pool Modes and pg_trickle
PgBouncer has three pool modes:
| Mode | How it works | pg_trickle compatibility |
|---|---|---|
| session | One PgBouncer connection = one database connection for the session lifetime | ✅ Full compatibility |
| transaction | PgBouncer returns the connection to the pool after each transaction | ⚠️ Works with caveats |
| statement | PgBouncer returns the connection after each statement | ❌ Not compatible |
Session mode
Session mode is fully compatible. Each application connection gets a dedicated database connection. Session-level state (prepared statements, temp tables, advisory locks, LISTEN) works normally.
If you can afford the connection overhead, session mode is the simplest option.
Transaction mode (the common case)
Transaction mode is what most production deployments use. It's also where the gotchas live.
What works:
pgtrickle.create_stream_table()— runs in a single transaction, works fine.pgtrickle.alter_stream_table()— same, single transaction.pgtrickle.refresh_stream_table()— single transaction.pgtrickle.pgt_status()— single query, works fine.- Reading from stream tables — normal SELECT queries, no issues.
- CDC triggers — fire inside the source transaction, work fine.
What doesn't work without configuration:
LISTEN/NOTIFY— requires a persistent connection. PgBouncer in transaction mode recycles the connection after the transaction, dropping the LISTEN registration.- Advisory locks (
pg_advisory_lock) — used by the relay for HA leader election. The lock is released when the connection is returned to the pool, which may happen unexpectedly in transaction mode. - The background worker — this connects directly to PostgreSQL, bypassing PgBouncer entirely. Not affected by pool mode.
Statement mode
Statement mode returns the connection to the pool after every statement. This breaks multi-statement transactions, which pg_trickle's internal operations require. Don't use statement mode with pg_trickle.
The Background Worker Bypass
pg_trickle's background worker — the process that runs the scheduler and executes refresh cycles — connects directly to PostgreSQL using the shared_preload_libraries mechanism. It doesn't go through PgBouncer.
This means:
- The scheduler works regardless of your PgBouncer configuration.
- Refresh cycles are not affected by pool mode.
- The worker uses its own dedicated connection, separate from the application pool.
The background worker's connection is configured via PostgreSQL's pg_trickle.database GUC, not via the PgBouncer connection string. Make sure this points to PostgreSQL directly:
-- postgresql.conf
pg_trickle.database = 'mydb' -- Direct connection, not through PgBouncer
LISTEN/NOTIFY Through PgBouncer
If your application uses pg_trickle's reactive subscriptions (LISTEN/NOTIFY for stream table changes), you need a persistent connection for the LISTEN registration.
Option 1: Dedicated non-pooled connection.
Most PgBouncer configurations allow specifying a pool that uses session mode:
# pgbouncer.ini
[databases]
mydb = host=localhost port=5432 dbname=mydb
mydb_listen = host=localhost port=5432 dbname=mydb pool_mode=session pool_size=5
Your application uses mydb (transaction mode) for normal queries and mydb_listen (session mode) for LISTEN connections.
Option 2: Use the outbox instead of LISTEN/NOTIFY.
If you don't want to manage a separate connection pool, skip LISTEN/NOTIFY and use the outbox + relay pattern. The relay maintains its own persistent connection to PostgreSQL (bypassing PgBouncer) and delivers notifications to your application via Kafka, NATS, or webhooks.
Advisory Locks and the Relay
The relay uses advisory locks for leader election. In transaction mode, PgBouncer might return the connection to the pool between transactions, releasing the advisory lock.
Solution: The relay should connect directly to PostgreSQL, not through PgBouncer. Configure the relay's postgres_url to point to the PostgreSQL port, not the PgBouncer port:
# relay.toml
[global]
# Direct connection to PostgreSQL (port 5432), not PgBouncer (port 6432)
postgres_url = "postgres://user:pass@localhost:5432/mydb"
Prepared Statements
PgBouncer in transaction mode can optionally disable server-side prepared statements (the default behavior) or support them with prepared_statements mode.
pg_trickle's SQL functions don't use prepared statements internally — they use SPI (Server Programming Interface), which executes queries directly. So this setting doesn't affect pg_trickle's operation.
However, if your application uses prepared statements to query stream tables (e.g., PREPARE get_orders AS SELECT * FROM order_summary WHERE region = $1), you need PgBouncer's prepared_statements mode or session mode.
Configuration Checklist
# pgbouncer.ini — recommended for pg_trickle
[databases]
mydb = host=localhost port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 200
default_pool_size = 20
# Important: allow DEALLOCATE ALL (pg_trickle cleanup)
server_reset_query = DISCARD ALL
-- postgresql.conf — background worker connects directly
shared_preload_libraries = 'pg_trickle'
pg_trickle.enabled = on
pg_trickle.database = 'mydb'
# relay.toml — relay connects directly, not through PgBouncer
[global]
postgres_url = "postgres://user:pass@localhost:5432/mydb"
Monitoring Through PgBouncer
All of pg_trickle's monitoring queries work through PgBouncer in transaction mode:
-- These all work through PgBouncer
SELECT * FROM pgtrickle.pgt_status();
SELECT * FROM pgtrickle.health_check();
SELECT * FROM pgtrickle.change_buffer_sizes();
SELECT * FROM pgtrickle.st_refresh_stats();
They're single-transaction, read-only queries with no session state requirements.
The Short Version
| Component | Through PgBouncer? | Notes |
|---|---|---|
| Application reads from stream tables | ✅ Yes | Normal SELECT queries |
| Application creates/alters/drops stream tables | ✅ Yes | Single-transaction DDL |
| Application LISTEN for notifications | ⚠️ Session mode only | Or use outbox + relay |
| Background worker (scheduler) | ❌ Direct to PostgreSQL | Automatic, no configuration |
| Relay | ❌ Direct to PostgreSQL | Configure postgres_url to skip PgBouncer |
| Monitoring queries | ✅ Yes | Transaction mode is fine |
pg_trickle works with PgBouncer in transaction mode for all common operations. The two things that need direct connections — the background worker and the relay — already bypass PgBouncer by design. The only thing you need to plan for is LISTEN/NOTIFY, and that has straightforward workarounds.
← Back to Blog Index | Documentation
The pgvector Tooling Landscape in 2026
What each tool actually does — and where pg_trickle fits in
If you've been building RAG applications on PostgreSQL, you've probably noticed the ecosystem has grown considerably in the last two years. There's pgai, pg_vectorize, Debezium with pgvector support, and a handful of managed-platform approaches. They all promise to "keep your embeddings in sync" or "automate your vector pipeline."
The problem is that these tools operate at different layers, solve different problems, and make different tradeoffs. Comparing them directly is like comparing a bread knife to a bread maker. They're related, but they're not doing the same job.
This is an honest look at what each tool actually does, where each one falls short, and where pg_trickle sits relative to all of them. No marketing. Just the actual mechanics.
First, a note about pgai
Let's start with the elephant in the room.
pgai was archived by Timescale on February 26, 2026.
If you've read any article about keeping embeddings fresh in PostgreSQL from the last year, pgai was probably the top recommendation. Timescale built it, marketed it heavily, and it accumulated 5,800 GitHub stars. The approach was elegant: declare a create_vectorizer() configuration, run a stateless Python worker process, and the system would keep your embeddings synchronized as data changes.
The archive doesn't mean pgai was bad — it means Timescale made a strategic pivot. The repository is now read-only. The Python library still exists and works. But the project is no longer actively developed as an open-source extension.
Why is this interesting? Because pgai's architecture revealed something about the fundamental problem with embedding pipelines. The vectorizer worker was always an external process. It used a queue inside PostgreSQL (a ai.work_queue table) to track what needed re-embedding. The worker polled that queue, called the external embedding API, and wrote the results back.
This architecture is correct for the problem it solves — handling unreliable external API calls in the background, with retries, rate limiting, and error isolation. But it's fundamentally a distributed system. You have two processes that need to stay in sync: the database and the worker. Schema changes, table drops, API failures, and worker restarts all require coordination.
The archive suggests this coordination cost was higher than expected, or that Timescale found a simpler architecture achieved the same goal with less operational surface area. Either way, it's a useful data point for anyone designing embedding infrastructure today.
pg_vectorize: What it does well
pg_vectorize (now maintained independently by Chuck Henderson, formerly of Tembo) is the most active open-source tool for incremental embedding updates in PostgreSQL. It's written in Rust, supports both a PostgreSQL extension mode and a standalone HTTP server mode, and uses pgmq (a message queue built on PostgreSQL) for asynchronous processing.
The basic flow looks like this:
# Start the embedding service alongside PostgreSQL
docker compose up -d
# Register a vectorization job
curl -X POST http://localhost:8080/api/v1/table \
-d '{
"job_name": "my_products",
"src_table": "products",
"src_columns": ["product_name", "description"],
"primary_key": "product_id",
"update_time_col": "updated_at",
"model": "sentence-transformers/all-MiniLM-L6-v2"
}'
# Search
curl "http://localhost:8080/api/v1/search?job_name=my_products&query=camping+gear&limit=5"
In extension mode, you use SQL functions (vectorize.table(), vectorize.search()) for the same operations. The extension relies on pgmq under the hood — source-table changes enqueue messages, the worker dequeues them, calls the embedding service, and writes back.
What pg_vectorize does well:
- Keeps a single table's embeddings synchronized with its source text column.
- Works with managed PostgreSQL (RDS, Cloud SQL) where you can't install arbitrary extensions — just run the HTTP server separately.
- Handles failures gracefully via the pgmq retry mechanism.
- Local model support (via
vector-serve, a bundled embedding server) — no dependency on external API if you run your own models. - Actively maintained: the v0.26.x line was released this week.
What pg_vectorize doesn't do:
- Multi-table denormalization. It syncs
source_column → embedding. It doesn't maintain a denormalized join ofdocuments + tags + permissions + metadataas a searchable flat table. - Aggregate vectors. There's no
vector_avgconcept — no way to maintain per-user or per-cluster centroids incrementally. - ANN index management. It embeds rows; it does not know about or manage IVFFlat drift or HNSW tombstone accumulation.
- SQL expressiveness. The vectorization pipeline is: one table, one text column, one embedding model. The richer "define any query, maintain the result" pattern is out of scope.
pg_vectorize solves a narrow but real problem cleanly. If your use case is exactly "text column changed → re-embed → update vector column," it's a solid choice.
The DIY approach: Why most teams still roll their own
Despite both pgai and pg_vectorize existing, most production RAG systems use a homebrew pipeline. The pattern:
- A database trigger or application-level change tracking (e.g., an
updated_attimestamp, aneeds_reembeddingboolean flag). - A background job (Celery, Sidekiq, AWS Lambda on SQS, a simple
while True:loop) that polls for rows needing re-embedding. - A call to the embedding API.
- A write-back to the database.
This pattern persists because it's flexible. You control the embedding logic. You can batch efficiently. You can handle weird edge cases (chunks, metadata injection, different models for different content types) without fighting an abstraction layer.
The cost is operational: you own the retry logic, the failure alerting, the backlog monitoring, the queue depth, and the correctness guarantees. When the worker falls behind or crashes silently, someone gets paged.
The fundamental limitation of all these approaches — DIY, pgai, and pg_vectorize alike — is that they answer a single question: "When the source text changes, what's the new embedding?"
They don't answer: "When any input to my search corpus changes — text, metadata, permissions, tags, related records — how do I propagate that change to the thing users are actually searching over?"
Debezium + pgvector: The enterprise approach
Debezium is a CDC (change data capture) tool from Red Hat that reads PostgreSQL's write-ahead log, converts row-level changes into structured events, and streams them to Kafka. Since 2024, Debezium has supported vector column types — changes to pgvector columns are serialized correctly and can flow through the Kafka ecosystem.
The typical architecture:
PostgreSQL (WAL) → Debezium → Kafka → Consumer (Python/Java) → Embedding API → Write back to PostgreSQL
Or for a search-specific variant:
PostgreSQL (WAL) → Debezium → Kafka → Consumer → Compute denormalized record → Write to search PG instance
What Debezium brings:
- Rock-solid, battle-tested CDC for PostgreSQL.
- Works at scale — it's what you use when you're running a multi-hundred-GB database with heavy write load.
- Flexible: the Kafka consumer can do anything — embedding, enrichment, routing to multiple sinks, exactly-once delivery.
- Supports
vectortype natively now (useful for migrating vector data between systems).
What Debezium doesn't bring:
- Anything in-database. Debezium requires Kafka, a Kafka Connect deployment, and at minimum one consumer service. You're building and maintaining a distributed system.
- SQL-level reasoning. The consumer sees row-level deltas, not query-level semantics. If your denormalized search document depends on five tables, the consumer must join and reconcile those changes itself — which is basically writing a CDC-aware ETL pipeline from scratch.
- Incremental aggregation. Debezium streams changes. It doesn't compute vector means, maintain group aggregates, or understand what a "change to one source row" means for a derived result.
- Anything approaching "keep my ANN index fresh" — you'd need to build that on top.
Debezium is infrastructure, not application logic. It moves data reliably. It doesn't maintain derived state.
The Dedicated Vector Databases: Not What You Think
Pinecone, Weaviate, Qdrant, Milvus — these are purpose-built systems that offer excellent approximate nearest-neighbor search with rich filtering. They're also frequently compared to pgvector.
Worth saying directly: these aren't really pgvector tooling. They're alternative systems, not enhancements. If you're on pgvector, you've already decided to stay in PostgreSQL. The dedicated vector databases solve a different problem for a different choice.
But they're relevant here because they're the most common alternative when pgvector users hit production scale problems. The pattern is: start with pgvector, hit a friction point (staleness, index maintenance, multi-table search complexity), and investigate moving to a dedicated vector database.
What dedicated vector databases do well:
- Very fast ANN search with rich metadata filtering, optimized end-to-end.
- Managed services with automatic index maintenance.
- High-dimensional vectors (many support >10,000 dimensions natively).
- Real-time ingestion at high write volume.
What they don't do:
- Transactional consistency with your PostgreSQL source data. You have a synchronization problem between your source database and the vector store.
- SQL joins, GROUP BY, window functions, CTEs over your vector corpus — you're limited to metadata filters.
- Any notion of incremental view maintenance. You push rows in; they're indexed. There's no concept of "this row's value is derived from these five other tables."
- Free. The managed options are expensive at scale.
The migration from pgvector to Pinecone/Weaviate/etc. typically doesn't simplify the embedding pipeline — it just moves the synchronization problem from "database ↔ embedding column" to "database ↔ external service." You still need to know when things changed and push them over.
What All These Tools Have in Common
Step back from the specifics and a pattern emerges.
Every tool in this list — pgai (archived), pg_vectorize, DIY batch jobs, Debezium pipelines, dedicated vector databases — is solving the same narrow problem:
When source text or data changes, how do I generate or re-deliver the embedding to where it needs to be?
That's Layer 1: embedding generation and delivery.
What none of them solve is Layer 2: maintaining the derived structures that live downstream of those embeddings, and keeping them consistent with both their source embeddings and the rest of your data.
Where pg_trickle Fits
pg_trickle operates entirely at Layer 2 — and occasionally Layer 1 overlap, but in a fundamentally different way.
The core of pg_trickle is incremental view maintenance (IVM). You define a SQL query, and pg_trickle maintains the result of that query as a live table, updating it incrementally as inputs change. The query can be anything: joins, aggregates, window functions, CTEs, subqueries.
For embeddings specifically, this matters in three ways that none of the Layer 1 tools address.
The denormalization problem
Your embedding column lives in documents.embedding. Your users search for documents. But what they're actually searching over is:
"A document, from an active project, with its category name, its author's display name, and its ACL groups — searchable by the text of the document and the vector of that text."
That's four tables. The embedding is one attribute of one of them. What users search over is a denormalized flat record composed of all four.
pg_trickle maintains that flat record automatically:
SELECT pgtrickle.create_stream_table(
name => 'doc_search_corpus',
query => $$
SELECT
d.id,
d.body,
d.embedding, -- your embedding, wherever it lives
p.name AS project_name,
u.display_name AS author,
acl.allowed_groups,
array_agg(t.name) AS tags
FROM documents d
JOIN projects p ON p.id = d.project_id
JOIN users u ON u.id = d.author_id
JOIN doc_acl acl ON acl.doc_id = d.id
LEFT JOIN doc_tags t ON t.doc_id = d.id
WHERE d.published = true
GROUP BY d.id, p.name, u.display_name, acl.allowed_groups
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
When someone updates a tag, renames a project, changes a permission, or edits the document text — only the affected rows in doc_search_corpus are updated. The rest is untouched. The HNSW index on doc_search_corpus.embedding receives targeted insert/delete pairs, not a full rebuild.
pg_vectorize cannot do this. It doesn't have a join concept. pgai couldn't do this either. This is a fundamentally different operation: SQL-level derivation, not embedding-API orchestration.
The aggregate vector problem
Recommendation systems, clustering pipelines, and personalization engines all maintain aggregate vectors — centroid-style representations computed over groups of individual embeddings.
-- The "taste" of user 42, averaged over every item they've liked
SELECT vector_avg(item.embedding)
FROM user_likes ul
JOIN items i ON i.id = ul.item_id
WHERE ul.user_id = 42;
If you precompute this per user and store it (the only production-viable approach), it goes stale the instant someone likes a new item. The classic solutions are "recompute nightly" or "trigger a background job on each like."
pg_trickle maintains this incrementally:
SELECT pgtrickle.create_stream_table(
name => 'user_taste',
query => $$
SELECT ul.user_id,
vector_avg(i.embedding) AS taste_vec,
COUNT(*) AS like_count
FROM user_likes ul
JOIN items i ON i.id = ul.item_id
GROUP BY ul.user_id
$$,
refresh_mode => 'DIFFERENTIAL'
);
When user 42 likes item 1701, the engine computes:
new_taste = (old_sum_vector + item_1701.embedding) / (old_count + 1)
Only user 42's row changes. The HNSW index on taste_vec receives one update. One million users, thousands of likes per second — each cycle touches only the affected rows.
The algebraic trick here (vector_avg as a running sum / count) is the same mathematics pg_trickle uses for AVG(price) or SUM(revenue). Extending it to vectors requires a rule-set addition, not an engine change. That's shipping in v0.37.
The index maintenance problem
IVFFlat recall drops as the distribution of your embeddings shifts. HNSW accumulates tombstones as you delete old documents. Neither pgvector nor any of the embedding tools handle this automatically.
pg_trickle tracks rows_changed_since_last_reindex as a first-class catalog metric. You set a policy:
SELECT pgtrickle.alter_stream_table(
'doc_search_corpus',
post_refresh_action => 'reindex_if_drift',
reindex_drift_threshold => 0.10
);
After each refresh, the scheduler checks whether 10% of rows have changed since the last REINDEX. If so, it queues a concurrent REINDEX in a lower-priority tier that runs without blocking your queries. This ships in v0.38.
No other tool in this space does this. It's an operational gap that has existed since pgvector launched, and the standard answer is still "rebuild on a schedule."
The Natural Architecture
The important thing to understand is that pg_trickle and the Layer 1 embedding tools are not competitors. They're designed for adjacent problems.
The natural stack looks like this:
Source Tables (documents, users, products, ...)
│
▼ [Layer 1: pgai (archived) / pg_vectorize / your app code]
│ "text changed → call embedding API → write vector column"
│
Documents now have an embedding column
│
▼ [Layer 2: pg_trickle]
│ "embedding (and metadata) changed → maintain derived structures"
│
Denormalized corpora (search_corpus, user_taste, product_clusters, ...)
│
▼ [pgvector: HNSW / IVFFlat index, ANN queries]
│
Application queries
Layer 1 solves: "How do I re-embed a row when its text changes?" Layer 2 solves: "How do I maintain everything downstream of that embedding?"
They're different questions with different answers.
If you're using pg_vectorize to keep documents.embedding fresh, pg_trickle then builds on top of that. When pg_vectorize writes a new embedding back to documents.embedding, pg_trickle's CDC trigger captures that change and propagates it to every stream table that depends on documents. The denormalized search corpus updates automatically. User taste vectors that depend on item embeddings update automatically. The index maintenance policy fires automatically.
The key principle: embeddings are derived data, just like any other derived data. The fact that computing them requires an external API call is a Layer 1 concern. Everything that happens after those embeddings exist — how they combine with other data, how they aggregate, how they're indexed, how freshness is monitored and enforced — is a Layer 2 concern, and that's where pg_trickle operates.
The Honest Comparison Matrix
Here's where each tool actually sits:
| Capability | DIY batch | pg_vectorize | pgai (archived) | Debezium | pg_trickle |
|---|---|---|---|---|---|
| Generate embeddings via API | Hand-rolled | ✅ | ✅ | ❌ | ❌ (not its job) |
| Async retry & rate-limit handling | Hand-rolled | ✅ | ✅ | ❌ | ❌ |
| Single-table embedding sync | ✅ (manual) | ✅ | ✅ | ✅ | ✅ (passthrough) |
| Multi-table denormalized corpus | Hand-rolled | ❌ | ❌ | Partial (consumer code) | ✅ (native) |
Incremental vector_avg aggregates | ❌ | ❌ | ❌ | ❌ | ✅ (v0.37) |
| ANN index drift detection | ❌ | ❌ | ❌ | ❌ | ✅ (v0.38) |
| Full SQL expressiveness | ❌ | ❌ | ❌ | ❌ | ✅ |
| In-database, no external processes | ❌ | Partial | ❌ | ❌ | ✅ |
| ACID-correct derivation | ❌ | Partial | Partial | ❌ | ✅ |
| Reactive alerts on distance predicates | ❌ | ❌ | ❌ | ❌ | ✅ (v0.39) |
| Actively maintained | ✅ | ✅ | ❌ (archived) | ✅ | ✅ |
The Case For Simplifying Your Stack
The common outcome of adding Layer 1 tools, then adding Layer 2 orchestration, then adding Debezium for auditing, is a system with five moving parts that each require monitoring, versioning, and operational expertise.
The case for pg_trickle is not that it eliminates all infrastructure — you still need something to generate embeddings (an app, pgai, pg_vectorize, or pgml). The case is that it eliminates the ad hoc infrastructure: the hand-rolled denormalization sync, the custom batch job for centroid recomputation, the manual reindex schedule, the DIY staleness monitoring.
Those pieces live outside the database and outside the transaction boundary, which means they're brittle. They can fail silently, fall behind, or produce state inconsistent with the rest of your data. Pulling them into the database — where changes are captured transactionally, derivations are computed algebraically, and monitoring is a live view — makes them observable and correct by construction.
One embedding architecture that's held up well in production RAG systems:
-
Application writes
chunk_texttodocuments. If you're generating embeddings synchronously (fast models, low latency), writeembeddingin the same transaction. If you're using an async embedding service (slow models, high cost), use pg_vectorize or a lightweight background job to write the embedding asynchronously within a few seconds. -
pg_trickle maintains the search corpus, user taste vectors, product clusters, and any other derived structure as stream tables. These update automatically — within 5–10 seconds of the embedding arriving.
-
pgvector provides the HNSW or IVFFlat indexes on the stream tables. pg_trickle's drift-aware reindex policy keeps them healthy.
-
Application queries run against flat, fresh, indexed stream tables with clean metadata. No post-fetch filtering, no over-fetching.
The embedding generation is handled. The derived-state problem is handled. The index maintenance is handled. Everything is visible in one monitoring view.
What to Take Away
The pgvector tooling ecosystem in 2026 is in an interesting moment. pgai's archive is a signal that embedding pipelines as out-of-database processes carry real operational cost. pg_vectorize is the most pragmatic open-source answer to the embedding generation problem. Debezium is the enterprise answer to streaming data at scale. And pg_trickle is the answer to the problem that none of them address: derived state maintenance downstream of embeddings.
If you're at the stage of "I need to keep one embedding column in sync with one text column," pg_vectorize is the right tool. Start there.
If you're at the stage of "my search corpus is stale, my user taste vectors are rebuilt nightly, my IVFFlat index is drifting, and my denormalized search document is hand-maintained with triggers," that's the Layer 2 problem. That's pg_trickle's domain.
The two aren't alternatives. They're layers. And understanding which layer your problem lives in is the most important step in choosing the right tool.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source, documentation, and the pgvector integration roadmap are at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
PostGIS + pg_trickle: Incremental Geospatial Aggregates
Heatmaps, spatial clustering, and geo-joins that update in milliseconds as new points arrive
Geospatial analytics has a dirty secret: most of the expensive computations are aggregations that could be maintained incrementally, but nobody does it because the tooling doesn't exist. You have a table of GPS points that grows by millions of rows per day. You need a heatmap of activity density. You need to know how many delivery vehicles are in each zone. You need real-time counts of events within administrative boundaries.
The standard approach is to periodically rebuild the entire spatial aggregate — scan all points, perform spatial containment tests, group and count. For a table with 100 million GPS points and 500 geographic zones, that's 100 million ST_Contains calls. It takes minutes. Your heatmap is always stale.
pg_trickle changes the economics. When a new GPS point arrives, only the zone containing that point needs its count updated. The incremental cost is one ST_Contains call per new point — not 500 million.
The Spatial Join Problem
The fundamental operation in geospatial analytics is the spatial join: matching points (or polygons) to regions. In PostgreSQL with PostGIS:
SELECT
z.zone_name,
COUNT(*) AS event_count,
AVG(e.magnitude) AS avg_magnitude
FROM events e
JOIN zones z ON ST_Contains(z.geom, e.location)
GROUP BY z.zone_name;
This query joins every event to the zone that contains it, then aggregates per zone. With a GiST index on zones.geom, each point lookup is fast (milliseconds). But doing it for 100 million points is still slow in aggregate.
As a stream table:
SELECT pgtrickle.create_stream_table(
'zone_event_summary',
$$
SELECT
z.zone_name,
COUNT(*) AS event_count,
AVG(e.magnitude) AS avg_magnitude
FROM events e
JOIN zones z ON ST_Contains(z.geom, e.location)
GROUP BY z.zone_name
$$
);
Now, when 1,000 new events arrive, the refresh performs 1,000 spatial lookups (one per new event), identifies which zones they fall in, and increments the counts for those zones. The other 99,999,000 existing events are not re-examined. The zones that received no new events are not touched.
Live Density Heatmaps
Heatmaps partition the world into grid cells and count observations per cell. The finer the grid, the more cells — and the more expensive a full recompute becomes.
-- Grid-based heatmap: divide the world into 100m×100m cells
SELECT pgtrickle.create_stream_table(
'activity_heatmap',
$$
SELECT
ST_SnapToGrid(location, 0.001) AS grid_cell, -- ~100m at mid-latitudes
COUNT(*) AS density,
MAX(recorded_at) AS last_activity
FROM gps_tracks
GROUP BY ST_SnapToGrid(location, 0.001)
$$
);
For a city-scale deployment tracking ride-share vehicles, this might produce 50,000 active grid cells. A full recompute scans all historical GPS points. An incremental refresh processes only the new GPS points since the last refresh and updates only the cells they fall into. If 10,000 new points arrive in a 5-second window, spread across 2,000 cells, the refresh touches 2,000 out of 50,000 cells. The other 48,000 are untouched.
This is particularly valuable for live dashboards. A fleet management screen showing vehicle density doesn't need to wait 30 seconds for a full spatial aggregation. With incremental maintenance, the heatmap updates in under 100 milliseconds after each batch of new GPS points.
Geofencing at Scale
Geofencing — detecting when entities enter or leave defined regions — is traditionally implemented as a stream processing problem. You set up Kafka, write a Flink job that maintains state per entity, and detect boundary crossings. It's powerful but operationally complex.
With pg_trickle, geofencing is just a spatial join maintained incrementally:
-- Track which vehicles are currently in which zones
SELECT pgtrickle.create_stream_table(
'vehicle_zone_assignment',
$$
SELECT
v.vehicle_id,
v.last_location,
z.zone_id,
z.zone_name,
z.zone_type
FROM vehicles v
JOIN zones z ON ST_Contains(z.geom, v.last_location)
$$
);
When a vehicle's location is updated, the stream table recomputes which zone it's now in. If the zone changed (the vehicle crossed a boundary), the old row is removed and the new row is inserted. Downstream consumers — alert tables, notification triggers, audit logs — can react to the change.
For counting vehicles per zone:
SELECT pgtrickle.create_stream_table(
'zone_vehicle_counts',
$$
SELECT
zone_id,
zone_name,
COUNT(*) AS vehicle_count
FROM vehicle_zone_assignment
GROUP BY zone_id, zone_name
$$
);
This cascading stream table updates when vehicles move between zones. If 100 vehicles update their positions but stay in the same zones, the count table doesn't change. If 5 vehicles cross zone boundaries, only those 5 transitions propagate. The cost scales with boundary crossings, not with position updates.
Distance-Based Aggregation
Another common pattern is aggregating data within a radius of reference points — "how many events occurred within 1km of each store location?"
SELECT pgtrickle.create_stream_table(
'store_nearby_events',
$$
SELECT
s.store_id,
s.store_name,
COUNT(*) AS nearby_event_count,
AVG(e.severity) AS avg_severity
FROM stores s
JOIN incidents e ON ST_DWithin(s.location::geography, e.location::geography, 1000)
GROUP BY s.store_id, s.store_name
$$
);
Each new incident is checked against store locations within 1km. With a spatial index, this is a fast lookup. The incremental cost is one spatial index probe per new incident, regardless of how many historical incidents exist. For a chain with 500 stores monitoring incidents in their vicinities, the refresh for 50 new incidents requires 50 × (index probe into 500 stores) = a few milliseconds.
Polygon-on-Polygon Overlaps
Not all geospatial analytics involve points. Land use analysis, flood zone mapping, and zoning compliance require polygon overlap computations:
SELECT pgtrickle.create_stream_table(
'parcel_flood_exposure',
$$
SELECT
p.parcel_id,
p.owner,
fz.flood_zone_class,
ST_Area(ST_Intersection(p.geom, fz.geom)) / ST_Area(p.geom) AS pct_in_flood_zone
FROM parcels p
JOIN flood_zones fz ON ST_Intersects(p.geom, fz.geom)
$$
);
When flood zone boundaries are redrawn (updated polygons in the flood_zones table), only parcels that intersect with the changed boundaries need recomputation. If a flood zone update affects 200 out of 50,000 parcels, the incremental refresh processes 200 intersection calculations — not 50,000.
Performance Characteristics
The performance advantage of incremental spatial aggregation depends on two factors:
- Spatial locality of changes — if new points cluster in a small number of zones, fewer aggregate groups are updated
- Index efficiency — PostGIS GiST indexes make point-in-polygon lookups O(log n), so the per-delta cost is low
Benchmark results for a fleet tracking scenario (10,000 vehicles, 500 zones, position updates every 10 seconds):
| Operation | Full recompute | Incremental refresh | Speedup |
|---|---|---|---|
| Vehicle counts per zone | 4.2s | 8ms | 525× |
| Heatmap (10k cells) | 12.7s | 22ms | 577× |
| Geofence violations | 6.1s | 5ms | 1,220× |
The geofence violation detection is the most dramatic because most position updates don't cross boundaries — the incremental engine correctly identifies that no change is needed for the vast majority of updates and skips them entirely.
Combining With Temporal Windows
Geospatial analytics often need temporal context — "activity in the last hour" or "events today." Combine date_trunc with spatial aggregation for time-windowed spatial analytics:
SELECT pgtrickle.create_stream_table(
'hourly_zone_activity',
$$
SELECT
z.zone_name,
date_trunc('hour', e.created_at) AS hour,
COUNT(*) AS event_count,
ST_Centroid(ST_Collect(e.location)) AS activity_centroid
FROM events e
JOIN zones z ON ST_Contains(z.geom, e.location)
WHERE e.created_at > now() - interval '24 hours'
GROUP BY z.zone_name, date_trunc('hour', e.created_at)
$$
);
The incremental engine processes new events by zone and hour bucket, and handles the sliding window by removing events that age out of the 24-hour window. Each refresh only touches the buckets that received new data or had data expire.
Your PostGIS data is already spatial. pg_trickle makes your spatial analytics incremental. The combination gives you live geospatial dashboards at a fraction of the traditional cost.
← Back to Blog Index | Documentation
Reactive Alerts Without Polling
PostgreSQL as a Push System — No Polling Loop Required
Your fraud detection system checks suspicious transactions every minute. Your inventory management system pings a "low stock" endpoint every 30 seconds. Your SLA monitoring service queries open support tickets every 15 seconds and sends a Slack message when something crosses a threshold.
All three of these are polling loops. They're querying the same data over and over, waiting for a condition to become true, then doing something.
Polling loops work. They're also wasteful by design — they spend 99% of their CPU on queries that return "nothing changed yet." They add latency equal to half the polling interval on average. They create a thundering herd problem when many services poll the same data. And they require someone to own the poll interval: too slow and you miss SLAs, too fast and you hammer the database.
There's a better model. It's called reactive subscriptions, and pg_trickle ships it in v0.39.
What Reactive Subscriptions Mean in Practice
The idea is simple: instead of asking "has this condition become true?" repeatedly, you subscribe to the condition and receive a notification when it fires.
-- Create a stream table for SLA monitoring
SELECT pgtrickle.create_stream_table(
name => 'ticket_sla',
query => $$
SELECT
t.id AS ticket_id,
t.team_id,
t.status,
t.priority,
t.created_at,
t.sla_deadline,
t.sla_deadline < NOW() AS breached,
t.sla_deadline - NOW() AS time_to_breach
FROM tickets t
WHERE t.status != 'resolved'
$$,
schedule => '30 seconds',
refresh_mode => 'DIFFERENTIAL'
);
-- Subscribe to the condition: tickets that just breached SLA
SELECT pgtrickle.subscribe(
stream_table => 'ticket_sla',
condition => 'breached = true AND OLD.breached = false',
channel => 'sla_breach_alerts',
payload => '{"ticket_id": ticket_id, "team_id": team_id, "priority": priority}'
);
When a ticket crosses its SLA deadline — which pg_trickle detects during the next refresh cycle — a notification fires on sla_breach_alerts. Your application, connected via PostgreSQL LISTEN, receives it.
No polling. No "check every 30 seconds." The notification fires once, at the right time, with exactly the data you need.
The Mechanics
pg_trickle's reactive subscriptions work at the stream table layer, not the raw table layer.
When the DVM engine applies a delta to a stream table, it evaluates each active subscription condition against every changed row. The condition can reference both OLD.* and NEW.* — the row values before and after the delta — enabling you to express transitions rather than just states.
This is the key capability that raw PostgreSQL triggers lack. A trigger fires on every change. A subscription fires only when the condition transitions from false to true.
State Transitions vs. Continuous Conditions
A continuous condition query:
WHERE breached = true
This fires on every refresh for every breached ticket — every 30 seconds until the ticket is resolved. That's not what you want for an alert.
A state transition:
WHERE breached = true AND OLD.breached = false
This fires exactly once per ticket, when it crosses from non-breached to breached. Subsequent refreshes see OLD.breached = true and don't re-fire.
pg_trickle maintains the OLD.* state implicitly — the previous values come from the stream table's current contents before the delta is applied.
Real-World Use Cases
Inventory Alerts
SELECT pgtrickle.create_stream_table(
name => 'inventory_status',
query => $$
SELECT
p.id AS product_id,
p.name,
p.sku,
i.qty,
i.reorder_point,
i.qty <= i.reorder_point AS below_reorder
FROM products p
JOIN inventory i ON i.product_id = p.id
$$,
schedule => '1 minute',
refresh_mode => 'DIFFERENTIAL'
);
-- Alert when a product drops below reorder point
SELECT pgtrickle.subscribe(
stream_table => 'inventory_status',
condition => 'below_reorder = true AND OLD.below_reorder = false',
channel => 'inventory_alerts',
payload => '{"product_id": product_id, "sku": sku, "qty": qty, "reorder_point": reorder_point}'
);
-- Alert when a product hits zero
SELECT pgtrickle.subscribe(
stream_table => 'inventory_status',
condition => 'qty = 0 AND OLD.qty > 0',
channel => 'stockout_alerts',
payload => '{"product_id": product_id, "sku": sku, "product_name": name}'
);
Both alerts fire exactly once per event, not once per polling cycle.
Fraud Detection
SELECT pgtrickle.create_stream_table(
name => 'customer_velocity',
query => $$
SELECT
customer_id,
COUNT(*) AS tx_count_1h,
SUM(amount) AS tx_volume_1h,
COUNT(DISTINCT merchant_id) AS distinct_merchants_1h,
COUNT(DISTINCT SPLIT_PART(ip_address, '.', 1)
|| '.' || SPLIT_PART(ip_address, '.', 2)) AS distinct_ip_class_b
FROM transactions
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY customer_id
$$,
schedule => '10 seconds',
refresh_mode => 'DIFFERENTIAL'
);
-- Flag when a customer exceeds velocity thresholds
SELECT pgtrickle.subscribe(
stream_table => 'customer_velocity',
condition => $$(
(tx_count_1h > 20 AND OLD.tx_count_1h <= 20)
OR (tx_volume_1h > 5000 AND OLD.tx_volume_1h <= 5000)
OR (distinct_merchants_1h > 10 AND OLD.distinct_merchants_1h <= 10)
)$$,
channel => 'fraud_review_queue',
payload => '{"customer_id": customer_id, "tx_count": tx_count_1h, "volume": tx_volume_1h}'
);
The velocity aggregates are maintained incrementally by the DVM engine. The subscription fires when any threshold is crossed. The fraud review service subscribes to fraud_review_queue via LISTEN and processes each notification.
No Kafka. No Redis Streams. No lambda function polling DynamoDB. Just PostgreSQL.
Vector Distance Alerts
This is where it gets interesting for ML workloads.
SELECT pgtrickle.create_stream_table(
name => 'user_drift',
query => $$
SELECT
u.id AS user_id,
vector_avg(i.embedding) AS current_taste_vec,
ut.baseline_taste_vec,
vector_avg(i.embedding) <=> ut.baseline_taste_vec AS taste_drift
FROM user_likes ul
JOIN items i ON i.id = ul.item_id
JOIN user_taste_baselines ut ON ut.user_id = ul.user_id
JOIN users u ON u.id = ul.user_id
WHERE ul.liked_at > NOW() - INTERVAL '7 days'
GROUP BY u.id, ut.baseline_taste_vec
$$,
schedule => '5 minutes',
refresh_mode => 'DIFFERENTIAL'
);
-- Notify when a user's taste has drifted significantly from their baseline
SELECT pgtrickle.subscribe(
stream_table => 'user_drift',
condition => 'taste_drift > 0.3 AND (OLD.taste_drift IS NULL OR OLD.taste_drift <= 0.3)',
channel => 'taste_drift_events',
payload => '{"user_id": user_id, "drift": taste_drift}'
);
When a user's recent listening/watching/browsing history has shifted enough from their baseline profile, a notification fires. Your recommendation service receives it and triggers a profile refresh. No polling, no CASE WHEN taste_drift > 0.3 check in a cron job.
Connecting Your Application
On the application side, you use PostgreSQL LISTEN:
import psycopg2
import json
conn = psycopg2.connect(dsn)
conn.set_isolation_level(0) # AUTOCOMMIT, required for LISTEN
cur = conn.cursor()
cur.execute("LISTEN sla_breach_alerts")
while True:
conn.poll()
while conn.notifies:
notify = conn.notifies.pop(0)
payload = json.loads(notify.payload)
handle_sla_breach(
ticket_id=payload['ticket_id'],
team_id=payload['team_id'],
priority=payload['priority']
)
The LISTEN connection is cheap — a long-lived idle connection that wakes up only when there's a notification. This is the fundamental difference from polling: the database pushes to you rather than you pulling from it.
For high-volume workloads where a single LISTEN connection can't process fast enough, notifications can be routed to a PostgreSQL-backed queue (pg_trickle integrates with pgmq) and consumed in parallel.
Why Not PostgreSQL Triggers?
You can implement a version of this with raw PostgreSQL triggers today. The trigger fires on INSERT/UPDATE, checks the condition, and calls pg_notify() if it's true.
The limitations:
- Triggers fire on raw table changes, not on derived state. If your alert condition depends on an aggregate (total volume in the last hour, count of open tickets, user taste drift), you can't express it as a trigger on a raw table.
- Transitions are hard. Expressing "fired only when transitioning from false to true" requires the trigger to query the previous state, which means an extra SELECT inside the trigger — adding latency to every write on the source table.
- Complexity. A fraud detection trigger that checks velocity aggregates inside itself is a trigger that does a GROUP BY on every transaction insert. The performance implications are significant.
Reactive subscriptions in pg_trickle are different because the condition evaluates against the stream table — derived state — not the raw source. The aggregate computation happens in the background worker. The subscription check is a comparison on already-computed values.
The Latency Story
With a 10-second refresh interval, you'll see a maximum alert latency of 10 seconds. The average is 5 seconds.
Is that too slow? For most alert use cases — SLA monitoring, inventory management, business KPI dashboards — no. For true real-time applications (payment fraud that must be blocked in-flight, stock trading algorithms), a streaming system like Kafka Streams or Apache Flink at the millisecond level is the right tool.
pg_trickle's reactive subscriptions cover the 90% of alert use cases that don't need sub-second latency but have been paying the cost of polling anyway. If your team currently runs polling loops at 30-second or 1-minute intervals and wishes they were faster, reactive subscriptions are the right answer.
What This Changes Architecturally
The polling loop paradigm produces a certain kind of system:
- Each service owns its own polling schedule
- The same data is read repeatedly by many services
- Latency is bounded by the poll interval, not by when data changes
- The alert fires at most once per interval, even if conditions have been true for longer
The reactive subscription paradigm produces a different kind of system:
- The database tracks when conditions become true
- Services receive notifications when they're relevant
- Latency is bounded by the refresh interval plus processing time
- The alert fires exactly when the condition first becomes true, not on a schedule
The second system is more correct, more efficient, and more composable. The transition is mechanical: replace each polling loop with a pgtrickle.subscribe() call and a LISTEN connection.
The polling loop was always a workaround for the lack of a good subscription primitive. Now there is one.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Real-Time Leaderboards That Don't Lie
Maintaining ranked lists incrementally — no full recomputation, no stale scores
Every gaming platform, sales dashboard, and coding challenge needs a leaderboard. The requirements sound simple: show the top N items, ordered by score, updated in real time.
The implementation is where teams get stuck.
The naive approach — run SELECT ... ORDER BY score DESC LIMIT 100 on every page load — works until the table has a few million rows and the leaderboard page takes 3 seconds to render.
The standard fix — cache the result in a materialized view and refresh it every few minutes — works until users notice their score updated 5 minutes ago and they're still not on the board.
The correct approach is to maintain the leaderboard incrementally: when a score changes, update only the affected positions. pg_trickle does this automatically.
A Basic Leaderboard
CREATE TABLE player_scores (
player_id bigint PRIMARY KEY,
username text NOT NULL,
score bigint NOT NULL DEFAULT 0,
updated_at timestamptz NOT NULL DEFAULT now()
);
-- Top 100 leaderboard, updated every second
SELECT pgtrickle.create_stream_table(
'leaderboard_top_100',
$$SELECT player_id, username, score
FROM player_scores
ORDER BY score DESC
LIMIT 100$$,
schedule => '1s',
refresh_mode => 'DIFFERENTIAL'
);
leaderboard_top_100 is a real table with exactly 100 rows. Reading it is a sequential scan of 100 rows — sub-millisecond.
When a player's score changes, pg_trickle's differential engine determines whether the change affects the top 100. If the player was already in the top 100 and their score increased, one row is updated. If a new player breaks into the top 100, one row is inserted and the player at position 100 is evicted.
The cost is proportional to the number of changes that affect the leaderboard, not the total number of players. A game with 10 million players and 50 score updates per second refreshes the leaderboard in under a millisecond per cycle.
Tied Scores
Ties are the first thing that breaks naive leaderboard implementations. If players 47 through 53 all have a score of 8,500, what's the ranking?
The answer depends on your tiebreaker. pg_trickle respects whatever ORDER BY you specify:
-- Tiebreaker: earliest to reach the score wins
SELECT pgtrickle.create_stream_table(
'leaderboard_top_100',
$$SELECT player_id, username, score, updated_at
FROM player_scores
ORDER BY score DESC, updated_at ASC
LIMIT 100$$,
schedule => '1s',
refresh_mode => 'DIFFERENTIAL'
);
With updated_at ASC as the tiebreaker, the player who reached the score first ranks higher. This is deterministic and fair.
Multi-Category Leaderboards
Games often have multiple leaderboard categories: overall score, weekly score, per-game-mode score.
CREATE TABLE match_results (
id bigserial PRIMARY KEY,
player_id bigint NOT NULL,
game_mode text NOT NULL,
score int NOT NULL,
played_at timestamptz NOT NULL DEFAULT now()
);
-- Overall top 50
SELECT pgtrickle.create_stream_table(
'lb_overall_top50',
$$SELECT player_id, SUM(score) AS total_score
FROM match_results
GROUP BY player_id
ORDER BY total_score DESC
LIMIT 50$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
-- Per-mode top 20
SELECT pgtrickle.create_stream_table(
'lb_per_mode_top20',
$$SELECT game_mode, player_id, SUM(score) AS mode_score
FROM match_results
GROUP BY game_mode, player_id
ORDER BY mode_score DESC
LIMIT 20$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
-- Weekly top 100 (temporal: only matches from the current week)
SELECT pgtrickle.create_stream_table(
'lb_weekly_top100',
$$SELECT player_id, SUM(score) AS weekly_score
FROM match_results
WHERE played_at >= date_trunc('week', now())
GROUP BY player_id
ORDER BY weekly_score DESC
LIMIT 100$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
Each leaderboard is independently maintained. The weekly leaderboard uses a time-window filter — pg_trickle handles the window eviction automatically as time passes.
Sales Dashboards
Leaderboards aren't just for games. Sales teams live and die by rankings:
CREATE TABLE deals (
id bigserial PRIMARY KEY,
sales_rep_id bigint NOT NULL,
amount numeric(12,2) NOT NULL,
stage text NOT NULL,
closed_at timestamptz
);
-- Q2 leaderboard: closed deals this quarter
SELECT pgtrickle.create_stream_table(
'sales_leaderboard_q2',
$$SELECT
s.id AS rep_id,
s.name AS rep_name,
t.name AS team_name,
SUM(d.amount) AS closed_revenue,
COUNT(*) AS deal_count
FROM deals d
JOIN sales_reps s ON s.id = d.sales_rep_id
JOIN teams t ON t.id = s.team_id
WHERE d.stage = 'closed_won'
AND d.closed_at >= '2026-04-01'
AND d.closed_at < '2026-07-01'
GROUP BY s.id, s.name, t.name
ORDER BY closed_revenue DESC
LIMIT 25$$,
schedule => '3s', refresh_mode => 'DIFFERENTIAL'
);
When a deal closes, the rep's ranking updates within 3 seconds. The sales manager sees it on the wall TV. No batch job, no data pipeline, no "wait until the ETL runs at midnight."
The Pagination Problem
Top-100 is clean. But what about the user who's ranked #4,372 and wants to see their neighborhood — ranks 4,370 to 4,380?
You have two options.
Option 1: Multiple stream tables with offsets. Create a stream table for each "page" of the leaderboard. This works for fixed segments (top 100, ranks 101–200, etc.) but doesn't scale to arbitrary pagination.
Option 2: A full ranking stream table. Skip the LIMIT and maintain the full ranked list:
SELECT pgtrickle.create_stream_table(
'player_rankings',
$$SELECT player_id, username, score,
SUM(score) AS total_score
FROM player_scores
GROUP BY player_id, username, score$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
-- Then query with pagination at read time:
SELECT *, ROW_NUMBER() OVER (ORDER BY total_score DESC) AS rank
FROM player_rankings
WHERE total_score BETWEEN
(SELECT total_score FROM player_rankings WHERE player_id = $1) - 100
AND
(SELECT total_score FROM player_rankings WHERE player_id = $1) + 100
ORDER BY total_score DESC;
The stream table maintains the pre-aggregated scores. The ranking and pagination happen at query time — which is fast because the stream table has far fewer rows than the raw events table, and you're filtering to a narrow score range around the target player.
Performance
Numbers from a synthetic benchmark:
| Scenario | Players | Score updates/s | Refresh cycle | Leaderboard read |
|---|---|---|---|---|
| Top-100, simple | 1M | 100 | ~0.3ms | ~0.1ms |
| Top-100, simple | 10M | 1,000 | ~0.8ms | ~0.1ms |
| Top-50 per category, 20 categories | 1M | 500 | ~1.2ms | ~0.1ms |
The read cost is constant — you're reading a fixed-size table. The refresh cost scales with the number of score changes that affect the leaderboard boundary, not the total number of players or updates.
Why Not Redis?
Redis sorted sets are the standard answer for leaderboards. They work. They're fast. But:
-
Dual-write problem. Your score data lives in PostgreSQL (because that's where your application logic, transactions, and constraints are). Keeping Redis in sync requires a pipeline — and that pipeline can fail, lag, or lose data.
-
No complex queries. Redis sorted sets support
ZRANGEBYSCOREandZRANK. They don't support JOINs, GROUP BY, time-window filters, or multi-table aggregation. Your "closed deals this quarter by team" leaderboard can't be a Redis sorted set. -
One more system. Redis adds operational overhead: monitoring, failover, persistence configuration, memory management.
With pg_trickle, the leaderboard is a PostgreSQL table. It's backed by the same WAL, the same backup, the same monitoring. The read performance is comparable to Redis for small result sets (sub-millisecond for 100 rows), and the consistency guarantee is stronger.
If your only leaderboard is a simple score ranking with no JOINs, Redis is fine. For anything more complex, pg_trickle keeps it in one place.
← Back to Blog Index | Documentation
Recursive CTEs That Update Themselves
Incremental view maintenance for graph queries, bill-of-materials, and org charts
Most IVM systems bail out when they see WITH RECURSIVE. The query involves an unknown number of iterations, the result set can grow or shrink unpredictably based on a single edge change, and the naive approach — recompute from scratch — defeats the purpose of incremental maintenance.
pg_trickle doesn't bail out. It maintains recursive CTEs incrementally using two strategies, chosen automatically based on the workload: semi-naive evaluation for insert-only tables, and Delete-and-Rederive for tables with mixed inserts, updates, and deletes.
This post explains how both work, when each applies, and what the practical limits are.
The Recursive CTE Recap
A recursive CTE has two parts: a base case and a recursive step.
WITH RECURSIVE reachable AS (
-- Base case: direct reports
SELECT employee_id, manager_id, 1 AS depth
FROM org_chart
WHERE manager_id = 42
UNION ALL
-- Recursive step: transitive closure
SELECT o.employee_id, o.manager_id, r.depth + 1
FROM org_chart o
JOIN reachable r ON o.manager_id = r.employee_id
WHERE r.depth < 10
)
SELECT * FROM reachable;
PostgreSQL evaluates this by repeatedly executing the recursive step until no new rows are produced. For a deep org chart, this might iterate 8–10 times.
The problem for IVM: when someone changes managers (UPDATE org_chart SET manager_id = 7 WHERE employee_id = 99), the entire reachability set can change. An employee and all their reports might move from one subtree to another. The delta isn't just one row — it's an arbitrarily large cascade.
Strategy 1: Semi-Naive Evaluation (Insert-Only)
When the source table only receives INSERTs (no updates or deletes), the incremental strategy is straightforward.
New rows inserted into org_chart can only add to the reachable set. They can't remove anything. So pg_trickle:
- Takes the newly inserted rows as the "frontier."
- Runs the recursive step using only the frontier as input.
- Any new rows produced become the next frontier.
- Repeats until the frontier is empty.
- Inserts all discovered rows into the stream table.
This is semi-naive evaluation — the classic technique from Datalog. Each iteration only processes rows discovered in the previous iteration, not the full accumulated result.
-- You create the stream table
SELECT pgtrickle.create_stream_table(
name => 'reachable_from_ceo',
query => $$
WITH RECURSIVE reachable AS (
SELECT employee_id, manager_id, 1 AS depth
FROM org_chart WHERE manager_id = 1
UNION ALL
SELECT o.employee_id, o.manager_id, r.depth + 1
FROM org_chart o
JOIN reachable r ON o.manager_id = r.employee_id
WHERE r.depth < 15
)
SELECT * FROM reachable
$$,
schedule => '5s'
);
If 3 new employees are added, pg_trickle doesn't re-traverse the entire org chart. It starts from those 3 rows, walks down their subtrees, and inserts only the newly reachable rows. For an org chart with 50,000 employees, adding 3 people touches maybe 3–15 rows instead of 50,000.
Limitation: pg_trickle uses the append-only fast path here. If you enable append_only => true and the source table later receives DELETEs or UPDATEs, pg_trickle will detect the violation and fall back to FULL refresh for that cycle.
Strategy 2: Delete-and-Rederive (Mixed DML)
When the source table has updates and deletes, the problem is harder. Removing an edge from a graph can make previously reachable nodes unreachable — but only if there's no alternative path.
Consider: employee 99 reports to manager 42. Manager 42 reports to the CEO. If you delete the edge 99 → 42, is 99 still reachable from the CEO? Only if 99 has another path (maybe through a dotted-line reporting relationship).
The Delete-and-Rederive algorithm handles this:
- Delete phase: Identify all rows in the stream table that might be affected by the source changes. This is conservative — it may over-delete.
- Rederive phase: Starting from the base case, re-derive the affected portion of the recursive result. Rows that are still reachable get re-inserted.
- Net delta: The difference between what was deleted and what was re-derived is the actual change to apply.
This is more expensive than semi-naive evaluation because it touches more rows. But it's still cheaper than a full recomputation when the change is localized. Deleting one edge in a graph with 100,000 nodes might affect a subtree of 200 nodes. Delete-and-Rederive processes those 200 nodes, not all 100,000.
Practical Example: Bill of Materials
A manufacturing BOM is a classic recursive structure. Parts contain sub-parts, which contain sub-sub-parts.
-- The source table
CREATE TABLE bom (
parent_part_id INT REFERENCES parts(id),
child_part_id INT REFERENCES parts(id),
quantity INT NOT NULL
);
-- The stream table: exploded BOM with cumulative quantities
SELECT pgtrickle.create_stream_table(
name => 'exploded_bom',
query => $$
WITH RECURSIVE exploded AS (
SELECT parent_part_id, child_part_id, quantity, 1 AS level
FROM bom
WHERE parent_part_id IN (SELECT id FROM parts WHERE is_top_level)
UNION ALL
SELECT e.parent_part_id, b.child_part_id,
e.quantity * b.quantity, e.level + 1
FROM exploded e
JOIN bom b ON b.parent_part_id = e.child_part_id
WHERE e.level < 20
)
SELECT parent_part_id,
child_part_id,
SUM(quantity) AS total_quantity,
MAX(level) AS max_depth
FROM exploded
GROUP BY parent_part_id, child_part_id
$$,
schedule => '10s'
);
When a supplier changes and you update a sub-component relationship, only the affected branch of the BOM tree is re-derived. The rest — potentially thousands of part relationships — stays untouched.
The Depth Guard
Recursive CTEs can diverge. A cycle in the graph (A → B → C → A) causes infinite recursion. PostgreSQL handles this with a WHERE depth < N guard or a CYCLE clause.
pg_trickle requires one of these guards. If it detects a recursive CTE without a termination condition, it rejects the query at creation time:
ERROR: recursive CTE 'reachable' has no termination guard
HINT: Add a WHERE depth < N or CYCLE clause to prevent infinite recursion
This is a deliberate design choice. An unbounded recursive CTE in a stream table could consume unbounded resources on every refresh cycle.
When to Use FULL Instead
Recursive CTEs with DIFFERENTIAL mode work well when:
- Changes are localized (a few edges added/removed per cycle)
- The graph is deep but sparse
- The recursion depth is bounded
They work poorly when:
- A single change can cascade to most of the result (e.g., changing the root node)
- The graph is dense and fully connected
- The result set is small enough that full recomputation is fast anyway
For the last case, pg_trickle's AUTO mode handles this automatically. If the cost model predicts that Delete-and-Rederive will touch more than 10% of the result, it falls back to FULL for that cycle.
Graph Reachability, Transitive Closure, Shortest Paths
The recursive CTE support covers several common graph patterns:
Transitive closure (who can reach whom):
WITH RECURSIVE closure AS (
SELECT src, dst FROM edges
UNION
SELECT c.src, e.dst
FROM closure c JOIN edges e ON c.dst = e.src
)
SELECT * FROM closure;
Shortest path (with depth tracking):
WITH RECURSIVE paths AS (
SELECT src, dst, 1 AS hops
FROM edges WHERE src = 'A'
UNION ALL
SELECT p.src, e.dst, p.hops + 1
FROM paths p JOIN edges e ON p.dst = e.src
WHERE p.hops < 10
)
SELECT dst, MIN(hops) AS shortest
FROM paths GROUP BY dst;
Ancestor queries (all ancestors of a node):
WITH RECURSIVE ancestors AS (
SELECT id, parent_id FROM categories WHERE id = 42
UNION ALL
SELECT c.id, c.parent_id
FROM categories c JOIN ancestors a ON c.id = a.parent_id
)
SELECT * FROM ancestors;
All three patterns work as stream tables with incremental maintenance. The depth guard applies to all of them.
Performance Characteristics
For a realistic benchmark — an org chart with 50,000 employees, 8 levels deep:
| Scenario | FULL refresh | DIFFERENTIAL refresh |
|---|---|---|
| 1 new hire (leaf) | 45ms | 2ms |
| 5 transfers (mid-level) | 45ms | 12ms |
| Reorg: move entire department (500 people) | 45ms | 38ms |
| Change CEO (affects everyone) | 45ms | 52ms |
The crossover point — where DIFFERENTIAL becomes more expensive than FULL — is around 15–20% of the tree being affected. pg_trickle's AUTO mode detects this and switches strategies.
Summary
Recursive CTEs are the SQL feature most people assume can't be maintained incrementally. pg_trickle does it with two algorithms:
- Semi-naive evaluation for insert-only workloads — processes only new frontier rows.
- Delete-and-Rederive for mixed DML — conservatively deletes affected rows, then re-derives them.
Both require a depth guard. Both are automatic — you write the recursive CTE, pg_trickle picks the strategy. And when the change is large enough that incremental maintenance isn't worth it, AUTO mode falls back to FULL.
If your data has a graph structure — org charts, BOMs, category trees, network topologies — and you're currently recomputing the closure on a timer, this is the post that should make you reconsider.
← Back to Blog Index | Documentation
The Relay Deep Dive: NATS, Redis Streams, and RabbitMQ
Beyond Kafka: five broker backends for pgtrickle-relay
The Kafka blog post covered the most common relay use case: streaming pg_trickle deltas to Kafka. But pgtrickle-relay supports six backends, and the non-Kafka ones are often a better fit depending on your infrastructure.
This post covers the other five: NATS JetStream, Redis Streams, RabbitMQ (AMQP), AWS SQS, and HTTP webhooks. Each has different semantics, different performance characteristics, and different operational profiles.
The Relay Architecture (Quick Recap)
pgtrickle-relay is a standalone binary that acts as a bridge:
pg_trickle outbox → relay binary → external broker
external broker → relay binary → pg_trickle inbox
It runs as a sidecar — not inside PostgreSQL, not inside the broker. One relay binary can manage multiple pipelines, each with its own source (outbox) and sink (broker).
Configuration is TOML:
[source]
connection = "postgres://user:pass@localhost/mydb"
[[pipelines]]
stream_table = "order_summary"
sink = "nats" # or redis, rabbitmq, sqs, webhook
NATS JetStream
NATS is a lightweight messaging system. JetStream adds persistence, replay, and consumer groups on top of NATS's fire-and-forget core.
When to Use NATS
- You want low-latency pub/sub with durable delivery.
- Your infrastructure is Kubernetes-native (NATS runs well as a StatefulSet).
- You need subject-based routing with wildcards.
- You're already running NATS for service-to-service messaging.
Configuration
[[pipelines]]
stream_table = "order_events"
sink = "nats"
[pipelines.nats]
url = "nats://nats-server:4222"
stream = "ORDERS"
subject = "orders.{op}.{outbox_id}"
Subject templates: The {op} placeholder expands to INSERT, UPDATE, or DELETE. The {outbox_id} is the monotonic sequence number. You can also use {stream_table} and {refresh_id}.
A subscriber can listen to:
orders.>— all order eventsorders.INSERT.>— only insertsorders.DELETE.>— only deletes
JetStream Consumer Groups
NATS JetStream supports consumer groups natively. Multiple relay consumers (for HA) can share a durable consumer group:
[pipelines.nats]
url = "nats://nats-server:4222"
stream = "ORDERS"
subject = "orders.>"
consumer_group = "relay-primary"
If the primary relay fails, the secondary picks up from the last acknowledged message.
Performance
NATS is fast. Expect:
- Publish latency: ~0.5ms per message (local NATS server)
- Throughput: 50,000+ messages/second sustained
- Replay: Full stream replay from any sequence number
Redis Streams
Redis Streams (XADD/XREAD) provide an append-only log similar to Kafka, but with Redis's operational simplicity.
When to Use Redis Streams
- You're already running Redis.
- You need a simple, low-ops message queue.
- Consumers are Redis-native (most languages have good Redis client libraries).
- You don't need cross-datacenter replication (Redis Cluster handles this, but it's more complex).
Configuration
[[pipelines]]
stream_table = "inventory_changes"
sink = "redis"
[pipelines.redis]
url = "redis://redis-server:6379"
stream_key = "pgtrickle:inventory"
maxlen = 100000 # optional: cap stream length
MAXLEN: Redis Streams can grow unbounded. Set maxlen to automatically trim old entries. With approximate trimming (~ prefix), Redis keeps roughly this many entries.
Consumer Groups
Redis Streams have native consumer groups (XGROUP):
[pipelines.redis]
url = "redis://redis-server:6379"
stream_key = "pgtrickle:inventory"
consumer_group = "indexer"
Multiple consumers in the group share the workload. Each message is delivered to exactly one consumer in the group.
Performance
- XADD latency: ~0.2ms per message
- Throughput: 100,000+ messages/second (single Redis instance)
- Memory usage: ~100 bytes per stream entry overhead
Caveat: Redis is memory-bound. If your stream tables produce large deltas (wide rows, many changes per cycle), the Redis memory footprint grows quickly. Monitor with XLEN and set maxlen to prevent OOM.
RabbitMQ (AMQP)
RabbitMQ uses exchanges and queues with routing keys. It's the most flexible in terms of message routing — fanout, direct, topic, and headers exchanges.
When to Use RabbitMQ
- You need complex routing (one message to multiple queues based on content).
- Your organization standardizes on AMQP.
- You need message-level TTL and dead-letter queues.
- You want per-message acknowledgments with redelivery on failure.
Configuration
[[pipelines]]
stream_table = "user_events"
sink = "rabbitmq"
[pipelines.rabbitmq]
url = "amqp://user:pass@rabbitmq-server:5672/vhost"
exchange = "pgtrickle.events"
exchange_type = "topic"
routing_key = "users.{op}"
Routing keys: With a topic exchange, consumers bind queues to patterns:
users.*— all user eventsusers.INSERT— only inserts*.DELETE— deletes from any stream table
Fanout Example
[pipelines.rabbitmq]
exchange = "pgtrickle.broadcast"
exchange_type = "fanout"
Every consumer queue bound to the exchange gets every message. Useful for broadcasting changes to multiple downstream services simultaneously.
Performance
- Publish latency: ~1–2ms per message
- Throughput: 10,000–30,000 messages/second (depends on persistence settings)
- Persistence: Durable exchanges and queues survive broker restart
Note: RabbitMQ's throughput is lower than NATS or Redis because it provides stronger delivery guarantees (persistent messages with acknowledgments). The trade-off is reliability vs. speed.
AWS SQS
SQS is AWS's managed message queue. No infrastructure to manage — it's a service.
When to Use SQS
- Your workloads run on AWS.
- You want zero message queue operations (no servers, no patching, no monitoring).
- Consumers are Lambda functions or ECS tasks.
- You need FIFO ordering within a message group.
Configuration
[[pipelines]]
stream_table = "order_summary"
sink = "sqs"
[pipelines.sqs]
queue_url = "https://sqs.us-east-1.amazonaws.com/123456789/pgtrickle-orders"
region = "us-east-1"
# Credentials from environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
# or IAM role
FIFO queues: For ordered delivery:
[pipelines.sqs]
queue_url = "https://sqs.us-east-1.amazonaws.com/123456789/pgtrickle-orders.fifo"
message_group_id = "{stream_table}"
message_dedup_id = "{outbox_id}"
The message_dedup_id ensures exactly-once delivery within SQS's 5-minute deduplication window.
Performance
- Publish latency: 5–20ms per message (network round-trip to SQS API)
- Throughput: ~3,000 messages/second (standard queue), ~300/second (FIFO)
- Cost: ~$0.40 per million messages
Batching: The relay batches up to 10 messages per SQS SendMessageBatch call, reducing API calls and latency.
HTTP Webhooks
The simplest sink: POST each message to an HTTP endpoint.
When to Use Webhooks
- You're integrating with a third-party service that accepts webhooks.
- You don't have a message broker and don't want one.
- The consumer is a serverless function (AWS Lambda, Cloudflare Workers).
- Volume is low enough that per-message HTTP calls are acceptable.
Configuration
[[pipelines]]
stream_table = "alert_triggers"
sink = "webhook"
[pipelines.webhook]
url = "https://api.example.com/webhooks/pgtrickle"
headers = { "Authorization" = "Bearer ${ENV:WEBHOOK_TOKEN}", "Content-Type" = "application/json" }
timeout_ms = 5000
retry_max = 3
retry_backoff_ms = 1000
Headers: Support environment variable interpolation (${ENV:VAR_NAME}) for secrets.
Retry: Failed POSTs are retried with exponential backoff: 1s, 2s, 4s. After retry_max attempts, the message is logged as failed and the relay continues.
Performance
- Latency: depends on the webhook endpoint (typically 50–200ms per request)
- Throughput: limited by endpoint response time and concurrency
- Reliability: best-effort (retries, but no persistent queue)
Caveat: Webhooks are inherently at-least-once. The endpoint must be idempotent. Use the outbox_id in the payload for deduplication.
Choosing a Backend
| Backend | Latency | Throughput | Ops Overhead | Best For |
|---|---|---|---|---|
| Kafka | 2–10ms | 100K+/s | High (ZK/KRaft, brokers) | Enterprise event streaming |
| NATS | 0.5ms | 50K+/s | Low (single binary) | Kubernetes-native, low-latency |
| Redis | 0.2ms | 100K+/s | Low (existing Redis) | Simple queues, existing Redis |
| RabbitMQ | 1–2ms | 10–30K/s | Medium | Complex routing, AMQP |
| SQS | 5–20ms | 3K/s | Zero | AWS-native, serverless |
| Webhook | 50–200ms | 10–100/s | Zero | Third-party integrations |
Decision tree:
- Already running Kafka? → Kafka.
- On AWS with no broker preference? → SQS.
- Want low-latency and run Kubernetes? → NATS.
- Already running Redis? → Redis Streams.
- Need complex routing? → RabbitMQ.
- Low volume, third-party integration? → Webhook.
Multi-Sink Pipelines
A single relay instance can run multiple pipelines to different backends:
[[pipelines]]
stream_table = "order_events"
sink = "kafka"
[pipelines.kafka]
brokers = "kafka:9092"
topic = "orders"
[[pipelines]]
stream_table = "order_events"
sink = "webhook"
[pipelines.webhook]
url = "https://slack-webhook.example.com/notify"
The same stream table's outbox feeds both Kafka and a Slack webhook. Each pipeline has its own consumer group and offset — they're independent.
Summary
pgtrickle-relay supports six backends: Kafka, NATS, Redis Streams, RabbitMQ, SQS, and HTTP webhooks. Each has different trade-offs in latency, throughput, operational complexity, and delivery semantics.
The relay binary is the same regardless of backend. Switch backends by changing the TOML config. Run multiple pipelines to different backends simultaneously. Use environment variables for secrets.
Pick the backend that matches your existing infrastructure. If you have no preference, NATS (for self-hosted) or SQS (for AWS) are the lowest-friction options.
← Back to Blog Index | Documentation
How We Replaced a Celery Pipeline with 3 SQL Statements
A before/after story about async Python workers and differential view maintenance
This is a story about a pipeline we built, how it grew, and what we eventually replaced it with.
The names are generic. The failure modes are real. If you're running async Python workers to maintain derived data in PostgreSQL, some of this will be familiar.
The Original Problem
We had an e-commerce platform. PostgreSQL for the operational data. A Elasticsearch for the search index. An event-driven architecture for "everything important should trigger something."
After 18 months, we had a serious denormalization problem. Our product search index needed data from seven tables: products, categories, brands, suppliers, tags, inventory, and price_history. The Elasticsearch document was a denormalized flat record of all of them. Keeping it in sync required knowing when any of the seven tables changed and what to reindex.
We built a Celery pipeline to handle it.
The Pipeline (Version 1)
PostgreSQL row change
→ row-level trigger writes to outbox table
→ Celery Beat polls outbox every 30 seconds
→ Celery worker resolves the change to a product_id
→ Celery worker fetches full denormalized record from PostgreSQL
→ Celery worker indexes into Elasticsearch
→ outbox entry marked processed
At launch, this worked fine. Products updated within 60 seconds. Acceptable.
Stats at launch:
- Pipeline components: 3 (PostgreSQL trigger, Celery Beat, Celery worker)
- Lines of code in pipeline: ~400
- Deployment steps: 2 (app deploy, worker deploy)
- Average latency, source change → indexed: 35 seconds
- Throughput: easily handled our 50–100 product changes/day
Version 2: Scale
We expanded to 200,000 products. Daily product data sync from supplier APIs: 15,000 updates per night. The 30-second polling interval, combined with a worker pool of 4, created a backlog that took 4 hours to drain.
Solutions attempted:
- Increase worker concurrency to 16. Memory on the worker instances ballooned. We hit PostgreSQL connection limits because each worker opened its own connection pool.
- Switch from polling to Redis Pub/Sub notifications. Reduced latency to 5 seconds on average. Added Redis as a dependency. Added a Redis sentinel deployment for HA.
- Add a
priorityfield to the outbox — high-traffic items process first. Added a manager task to compute priorities. Added a monitoring dashboard for queue depth by priority.
Stats at Version 2:
- Pipeline components: 5 (trigger, Beat, worker, Redis, priority manager)
- Lines of code in pipeline: ~1,200
- Deployment steps: 4
- Average latency: 5 seconds
- P99 latency: 180 seconds (during supplier sync)
- Incidents per month: ~2
Version 3: Correctness
During a supplier sync, we discovered that the order of Celery task execution didn't match the order of changes. Product 12345 was updated twice in 3 seconds by the supplier sync. Two tasks were enqueued. The tasks ran out of order. Elasticsearch ended up with the older version of the product.
The fix was to add a version field to products and the outbox, and have the worker skip indexing if the task's version was lower than the current version. This required:
- A
versioncolumn onproducts(auto-incremented on every UPDATE) - The outbox to store the version at change time
- The worker to re-fetch the product and check versions before indexing
- A migration for all existing products
We also discovered that the outbox table had grown to 8 million rows because the cleanup job had silently failed two months earlier. The cleanup job now ran on a schedule and sent an alert when the backlog exceeded 100k rows.
Stats at Version 3:
- Pipeline components: 6 (trigger, Beat, worker, Redis, priority manager, cleanup job)
- Lines of code in pipeline: ~1,800
- Deployment steps: 5
- Average latency: 5 seconds
- P99 latency: 180 seconds
- On-call runbook length: 6 pages
- Incidents per month: ~1.5
What We Actually Needed
At this point we had an honest conversation about what the pipeline was doing.
The Elasticsearch index was serving product search. The product search was our primary user-facing feature. Latency above 5 seconds was causing user complaints (products added to a campaign weren't appearing in searches fast enough).
The Elasticsearch requirement was actually an assumption, not a hard constraint. We had chosen Elasticsearch initially because we needed full-text search with fast faceting and we'd assumed PostgreSQL couldn't do it. By version 3, we were running PostgreSQL 18 with much better full-text support, pgvector for semantic search, and partial indexes for faceting.
We decided to run the experiment: what would it take to replace the Elasticsearch index with a denormalized PostgreSQL table and keep it fresh with IVM?
The Replacement: 3 SQL Statements
Statement 1: Create the stream table
SELECT pgtrickle.create_stream_table(
name => 'product_search',
query => $$
SELECT
p.id AS product_id,
p.name AS product_name,
p.description,
p.sku,
b.name AS brand_name,
b.id AS brand_id,
c.name AS category_name,
c.id AS category_id,
c.path AS category_path,
s.name AS supplier_name,
s.country AS supplier_country,
i.qty AS stock_qty,
i.qty > 0 AS in_stock,
ph.current_price,
ph.original_price,
ROUND(
(ph.original_price - ph.current_price) / ph.original_price * 100, 1
) AS discount_pct,
array_agg(t.name ORDER BY t.name)
AS tags,
p.embedding AS search_vec,
p.created_at,
p.updated_at
FROM products p
JOIN brands b ON b.id = p.brand_id
JOIN categories c ON c.id = p.category_id
JOIN suppliers s ON s.id = p.supplier_id
LEFT JOIN inventory i ON i.product_id = p.id
LEFT JOIN current_prices ph ON ph.product_id = p.id
LEFT JOIN product_tags pt ON pt.product_id = p.id
LEFT JOIN tags t ON t.id = pt.tag_id
WHERE p.active = true
GROUP BY p.id, b.name, b.id, c.name, c.id, c.path,
s.name, s.country, i.qty, ph.current_price,
ph.original_price, p.embedding, p.created_at, p.updated_at
$$,
schedule => '3 seconds',
refresh_mode => 'DIFFERENTIAL'
);
Statement 2: Create the full-text search index
-- GIN index for full-text search over product name + description + tags
CREATE INDEX product_search_fts_idx ON product_search
USING GIN (
to_tsvector('english',
product_name || ' ' || COALESCE(description, '') || ' ' ||
brand_name || ' ' || category_name || ' ' ||
COALESCE(array_to_string(tags, ' '), ''))
);
Statement 3: Create the vector search index
-- HNSW index for semantic search
CREATE INDEX product_search_vec_idx ON product_search
USING hnsw (search_vec vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
That's it. No Redis. No Celery. No outbox table. No cleanup jobs. No version fields.
How the Search Query Changed
Before:
# Elasticsearch query
results = es.search(
index="products",
query={
"bool": {
"must": [{"multi_match": {"query": q, "fields": ["product_name^3", "description", "brand_name"]}}],
"filter": [
{"term": {"in_stock": True}},
{"range": {"discount_pct": {"gte": min_discount}}}
]
}
},
size=20
)
product_ids = [r['_id'] for r in results['hits']['hits']]
# Then a second round-trip to PostgreSQL to get fresh data
products = db.query("SELECT * FROM products WHERE id = ANY(%s)", [product_ids])
After:
-- Hybrid full-text + semantic search, all in PostgreSQL
WITH semantic AS (
SELECT product_id, search_vec <=> $query_vec AS distance
FROM product_search
WHERE in_stock = true
AND discount_pct >= $min_discount
ORDER BY search_vec <=> $query_vec
LIMIT 50
),
fulltext AS (
SELECT product_id,
ts_rank_cd(
to_tsvector('english', product_name || ' ' || COALESCE(description,'') || ' ' || brand_name),
plainto_tsquery('english', $query_text)
) AS rank
FROM product_search
WHERE in_stock = true
AND to_tsvector('english', product_name || ' ' || COALESCE(description,'') || ' ' || brand_name)
@@ plainto_tsquery('english', $query_text)
),
rrf AS (
SELECT COALESCE(s.product_id, f.product_id) AS product_id,
COALESCE(1.0 / (60 + ROW_NUMBER() OVER (ORDER BY s.distance)), 0) +
COALESCE(1.0 / (60 + ROW_NUMBER() OVER (ORDER BY f.rank DESC)), 0) AS rrf_score
FROM semantic s
FULL OUTER JOIN fulltext f ON f.product_id = s.product_id
)
SELECT ps.*
FROM rrf
JOIN product_search ps ON ps.product_id = rrf.product_id
ORDER BY rrf_score DESC
LIMIT 20;
The data returned is always fresh — the product_search stream table is at most 3 seconds stale. There's no second round-trip to get "fresh data" because product_search is the fresh data.
The Numbers
| Metric | Celery+ES pipeline | pg_trickle+PostgreSQL |
|---|---|---|
| P50 search latency | 12ms | 8ms |
| P99 search latency | 85ms | 22ms |
| Data freshness (average) | 5s | 1.5s |
| Data freshness (P99) | 180s (during sync) | 3s |
| Deployment components | 6 | 1 (the PostgreSQL extension) |
| Monthly incidents | 1.5 | 0 |
| Engineering time per quarter maintaining the pipeline | ~3 weeks | ~0 |
The P99 improvement from 180s to 3s was the most impactful change for users. The supplier sync no longer caused a visible degradation window.
The search latency improvement was a bonus — the PostgreSQL query planner was better at pruning product_search with partial indexes than Elasticsearch was with filter clauses, and the elimination of the network round-trip and deserialization overhead from Elasticsearch helped.
What We Gave Up
Elasticsearch has capabilities that PostgreSQL+pgvector doesn't fully match:
- Approximate nearest neighbor at very large scale (hundreds of millions of vectors): HNSW in pgvector is competitive up to ~50M rows with appropriate hardware. Above that, dedicated vector databases have more tuning options.
- Built-in relevance tuning: Elasticsearch has a mature BM25 + learning-to-rank stack. We replicated most of what we needed with RRF, but a dedicated ML ranking system is more flexible.
- Cross-cluster search and federation: Elasticsearch's distributed search across multiple clusters is mature. PostgreSQL's distribution story requires Citus or similar.
- Elasticsearch-specific query DSL: Some power-user queries we'd built on top of Elasticsearch's query language required rewriting in SQL.
For our scale (500k products, 10M queries/month), none of these were blockers. For a much larger deployment, the calculus might be different.
The Maintenance Story
The Celery pipeline had a 6-page oncall runbook. The pg_trickle replacement has:
-- Check freshness
SELECT name, last_refresh_at, staleness_secs
FROM pgtrickle.stream_table_status()
WHERE name = 'product_search';
-- Check queue depth (how many pending changes)
SELECT source_table, pending_rows
FROM pgtrickle.change_buffer_status();
-- Force a refresh if needed
SELECT pgtrickle.refresh('product_search');
That's the runbook. It fits in a Slack message.
The three SQL statements that created this setup were the end of a 3-year journey through increasingly complex async pipeline infrastructure. The right abstraction was in the database the entire time.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Scalar Subqueries in the SELECT List — Incrementally
How pg_trickle maintains correlated subqueries without re-executing them for every row
Scalar subqueries in the SELECT list are a SQL convenience that hides enormous computational cost:
SELECT
o.order_id,
o.customer_id,
o.total,
(SELECT MAX(total) FROM orders o2 WHERE o2.customer_id = o.customer_id) AS customer_max
FROM orders o;
For each row in orders, PostgreSQL runs the inner query. If orders has 1 million rows, the subquery executes 1 million times. Materialized views don't change this — each refresh re-evaluates all subqueries.
pg_trickle maintains scalar subqueries incrementally using a pre/post snapshot diff technique. It doesn't re-run the subquery for every row — only for rows where the subquery result changed.
The Technique
A scalar subquery in the SELECT list is correlated — it references columns from the outer query. The result depends on which outer row it's evaluating.
pg_trickle transforms this into a two-phase process:
Phase 1: Pre-Snapshot
Before processing the delta, compute the scalar subquery result for affected rows using the previous state of the data:
-- Pre-snapshot: MAX(total) per customer, before changes
SELECT customer_id, MAX(total) AS customer_max
FROM orders
WHERE customer_id IN (SELECT customer_id FROM changed_rows)
GROUP BY customer_id;
Phase 2: Post-Snapshot
After applying the source changes, compute the scalar subquery result again:
-- Post-snapshot: MAX(total) per customer, after changes
SELECT customer_id, MAX(total) AS customer_max
FROM orders
WHERE customer_id IN (SELECT customer_id FROM changed_rows)
GROUP BY customer_id;
Phase 3: Diff
Compare pre and post snapshots. Only rows where the subquery result changed need to be updated in the stream table:
-- Rows where customer_max changed
SELECT post.customer_id, post.customer_max
FROM post_snapshot post
JOIN pre_snapshot pre USING (customer_id)
WHERE post.customer_max != pre.customer_max OR pre.customer_max IS NULL;
Why This Is Efficient
The key insight: the scalar subquery result changes only when the correlated group is affected by the delta.
If 10 new orders come in across 3 customers, only those 3 customers' MAX(total) values could change. The pre/post snapshots are computed only for those 3 customers — not for all 1 million.
| Scenario | Full recompute (materialized view) | Pre/post diff (pg_trickle) |
|---|---|---|
| 10 new orders, 3 customers | 1M subquery evaluations | 3 group evaluations |
| 100 new orders, 50 customers | 1M subquery evaluations | 50 group evaluations |
| 1 deleted order | 1M subquery evaluations | 1 group evaluation |
Creating a Stream Table
SELECT pgtrickle.create_stream_table(
name => 'orders_with_customer_stats',
query => $$
SELECT
o.order_id,
o.customer_id,
o.total,
(SELECT AVG(total) FROM orders o2
WHERE o2.customer_id = o.customer_id) AS customer_avg,
(SELECT COUNT(*) FROM orders o2
WHERE o2.customer_id = o.customer_id) AS customer_order_count
FROM orders o
$$,
schedule => '5s'
);
pg_trickle detects the two scalar subqueries, extracts their correlation keys (customer_id), and applies the pre/post diff technique to each.
Multiple Scalar Subqueries
When a query has multiple scalar subqueries, each is processed independently:
SELECT
p.product_id,
p.name,
(SELECT COUNT(*) FROM reviews r WHERE r.product_id = p.product_id) AS review_count,
(SELECT AVG(rating) FROM reviews r WHERE r.product_id = p.product_id) AS avg_rating,
(SELECT MIN(price) FROM inventory i WHERE i.product_id = p.product_id) AS min_price
FROM products p;
Three scalar subqueries, two source tables (reviews, inventory). When reviews change:
review_countandavg_ratingare re-evaluated for affected products.min_priceis untouched (inventory didn't change).
When inventory changes:
min_priceis re-evaluated for affected products.review_countandavg_ratingare untouched.
Each subquery tracks its own source-table dependency.
When a JOIN Is Faster
The pre/post diff technique works, but a JOIN rewrite is often more efficient. The scalar subquery:
SELECT o.*, (SELECT MAX(total) FROM orders o2 WHERE o2.customer_id = o.customer_id)
FROM orders o;
Is equivalent to:
SELECT o.*, m.max_total
FROM orders o
JOIN (SELECT customer_id, MAX(total) AS max_total FROM orders GROUP BY customer_id) m
USING (customer_id);
The JOIN version is more efficient for IVM because:
- The inner aggregate (
MAX(total)grouped bycustomer_id) can be maintained algebraically. - The JOIN delta is a standard equi-join delta — well-optimized.
- No pre/post snapshot comparison needed.
pg_trickle doesn't automatically rewrite scalar subqueries to JOINs (the equivalence isn't always trivial), but if performance matters, consider the manual rewrite.
Limitations
Non-correlated scalar subqueries (no reference to the outer query):
SELECT o.*, (SELECT COUNT(*) FROM products) AS total_products FROM orders o;
These are simpler — the subquery result is a single value shared by all rows. pg_trickle caches it and only recomputes when the subquery's source table changes.
Scalar subqueries with side effects: Not supported (and shouldn't be in any query).
Scalar subqueries returning more than one row: PostgreSQL errors at runtime. pg_trickle inherits this behavior.
Deeply nested scalar subqueries (subquery within subquery within SELECT):
SELECT o.*,
(SELECT (SELECT MAX(price) FROM products WHERE category = c.category)
FROM customers c WHERE c.id = o.customer_id) AS category_max_price
FROM orders o;
pg_trickle supports one level of nesting. Deeper nesting works but with increasing overhead — each level adds a pre/post snapshot comparison. Consider rewriting deeply nested scalar subqueries as JOINs.
Summary
Scalar subqueries in the SELECT list are maintained incrementally using pre/post snapshot comparison on the correlated group. Only groups affected by the delta are re-evaluated.
The cost is O(affected groups), not O(all rows). For typical workloads — small changes touching a few groups — this is orders of magnitude faster than full recomputation.
If performance is critical, consider rewriting scalar subqueries as JOINs for more efficient delta propagation. But if the scalar subquery is clearer and the performance is acceptable, leave it as is — pg_trickle handles it correctly either way.
← Back to Blog Index | Documentation
pg_trickle Monitors Itself
How the extension eats its own cooking
Since v0.20, pg_trickle's internal health metrics are maintained as stream tables. The extension uses itself to monitor itself.
This isn't a gimmick. It's a practical architecture decision: pg_trickle already has an engine for maintaining derived data from source tables. The extension's own operational data lives in PostgreSQL tables. Why build a separate monitoring system when you can point the same engine at its own catalog?
What Self-Monitoring Tracks
When you enable self-monitoring, pg_trickle creates a set of internal stream tables that aggregate operational data:
-- Enable self-monitoring
SELECT pgtrickle.enable_self_monitoring();
This creates stream tables over pg_trickle's own catalog and operational tables:
Refresh performance by stream table
-- Automatically created:
-- pgtrickle._self_refresh_stats
-- Query:
SELECT
st.pgt_name,
st.refresh_mode,
COUNT(*) AS refresh_count,
AVG(h.duration_ms) AS avg_refresh_ms,
MAX(h.duration_ms) AS max_refresh_ms,
SUM(h.rows_affected) AS total_rows_affected,
MAX(h.refreshed_at) AS last_refresh
FROM pgtrickle.pgt_stream_tables st
JOIN pgtrickle.pgt_refresh_history h ON h.pgt_id = st.pgt_id
WHERE h.refreshed_at >= now() - interval '1 hour'
GROUP BY st.pgt_name, st.refresh_mode
This stream table tells you, in real time, how each stream table is performing: average refresh time, max refresh time, total rows processed, and when it last refreshed.
Change buffer depth
-- pgtrickle._self_buffer_depth
-- Tracks how much unprocessed change data is queued per source table
SELECT
source_table,
COUNT(*) AS pending_changes,
MIN(captured_at) AS oldest_change,
now() - MIN(captured_at) AS max_latency
FROM pgtrickle_changes.changes_summary
GROUP BY source_table
If pending_changes is growing faster than the scheduler can drain it, this stream table surfaces the problem before it becomes a production incident.
Error rates
-- pgtrickle._self_error_rates
-- Tracks consecutive errors and error patterns
SELECT
st.pgt_name,
st.consecutive_errors,
st.status,
h.error_message,
h.refreshed_at AS last_error_at
FROM pgtrickle.pgt_stream_tables st
JOIN pgtrickle.pgt_refresh_history h ON h.pgt_id = st.pgt_id
WHERE h.success = false
AND h.refreshed_at >= now() - interval '1 hour'
Why This Matters
The typical monitoring setup for a database extension involves:
- A Prometheus exporter that polls the extension's views
- Grafana dashboards that visualize the metrics
- AlertManager rules that fire when thresholds are breached
This works, but there's a lag: Prometheus scrapes every 15–30 seconds. By the time the alert fires, the problem might have been happening for a minute.
Self-monitoring stream tables are maintained continuously — every 2–5 seconds. And because they're just PostgreSQL tables, you can:
- Query them with arbitrary SQL
- Build other stream tables on top of them (meta-monitoring)
- Subscribe to them via the outbox for real-time alerts
- Expose them to any BI tool that speaks PostgreSQL
Alerts on monitoring data
-- Alert when any stream table's refresh time exceeds 500ms
SELECT pgtrickle.create_stream_table(
'pgtrickle._self_slow_refreshes',
$$SELECT pgt_name, avg_refresh_ms, max_refresh_ms
FROM pgtrickle._self_refresh_stats
WHERE avg_refresh_ms > 500$$,
schedule => '5s', refresh_mode => 'DIFFERENTIAL'
);
-- Enable outbox so the relay can push alerts to Slack/PagerDuty
SELECT pgtrickle.enable_outbox('pgtrickle._self_slow_refreshes');
Now when a stream table starts refreshing slowly, the monitoring stream table picks it up within 5 seconds, and the outbox delivers a notification to your alerting system.
The Recursion Question
"If pg_trickle monitors itself with stream tables, who monitors the monitoring stream tables?"
Fair question. The self-monitoring stream tables are maintained by the same scheduler as user-defined stream tables. If the scheduler is completely dead, the monitoring tables aren't updated either.
pg_trickle handles this with a separate code path:
-
The scheduler's heartbeat is written directly to a catalog table (not via a stream table). External monitoring (Prometheus, health check endpoints) reads this heartbeat.
-
Self-monitoring stream tables track performance and trends — they're for observability, not liveness. The liveness check is the heartbeat.
-
pgtrickle.health_check()returns a summary that combines both: the heartbeat (is the scheduler running?) and the self-monitoring data (how is it performing?).
SELECT * FROM pgtrickle.health_check();
-- Returns: scheduler_running, total_stream_tables, active_count,
-- error_count, avg_refresh_ms, max_staleness, ...
Integration with Prometheus and Grafana
Self-monitoring doesn't replace Prometheus — it complements it. pg_trickle also exposes metrics in Prometheus format:
# Prometheus endpoint (built-in or via pgtrickle-relay)
pgtrickle_refresh_duration_seconds{stream_table="order_totals",mode="DIFFERENTIAL"} 0.012
pgtrickle_refresh_rows_affected{stream_table="order_totals"} 47
pgtrickle_change_buffer_depth{source_table="orders"} 123
pgtrickle_scheduler_lag_seconds 0.3
pgtrickle_stream_table_staleness_seconds{stream_table="order_totals"} 2.1
The difference is granularity: Prometheus gives you time-series data at scrape resolution (typically 15s). Self-monitoring stream tables give you real-time aggregates at refresh resolution (1–5s). Use both.
Disabling Self-Monitoring
If you don't need it — or if you're running in a resource-constrained environment — you can disable it:
SELECT pgtrickle.disable_self_monitoring();
This drops the internal stream tables and frees the scheduler slots they were using. The Prometheus metrics and health check endpoints continue working — they don't depend on self-monitoring.
The Dogfooding Effect
Building self-monitoring on top of the same engine forces the pg_trickle team to care about edge cases that only surface under real load. If the monitoring stream tables are slow, that's a bug in the engine. If they produce incorrect results, that's a correctness issue. If they consume too many resources, that's a scheduler efficiency problem.
Every improvement to the self-monitoring stream tables is an improvement to the engine itself.
← Back to Blog Index | Documentation
Set Operations Done Right: UNION, INTERSECT, EXCEPT
Incremental maintenance of set operations with multiplicity tracking
Set operations in SQL — UNION, INTERSECT, EXCEPT — combine results from multiple queries. They're straightforward to compute from scratch but surprisingly subtle to maintain incrementally. The subtlety is in the multiplicities: how many copies of each row exist on each side, and what happens when those counts change.
pg_trickle maintains all three set operations (and their ALL variants) incrementally using dual-count multiplicity tracking. Each result row tracks how many copies exist on the left side and the right side, and the set operation's semantics determine whether the row appears in the output.
UNION ALL: The Simple Case
SELECT name, email FROM customers
UNION ALL
SELECT name, email FROM prospects;
UNION ALL is the simplest: no deduplication. Every row from both sides appears in the result. The delta rule is trivial:
- Insert on left → insert in result.
- Insert on right → insert in result.
- Delete on left → delete from result.
- Delete on right → delete from result.
pg_trickle handles this with standard delta propagation. No multiplicity tracking needed.
UNION (Deduplicating): Where It Gets Interesting
SELECT name, email FROM customers
UNION
SELECT name, email FROM prospects;
UNION (without ALL) deduplicates: if the same row exists in both sides, it appears once in the result.
The delta rule requires knowing how many copies of each row exist across both sides:
left_count = number of copies on the left side
right_count = number of copies on the right side
total = left_count + right_count
Row appears in result if total > 0.
Insert on left side:
left_count += 1
if left_count + right_count was 0 (row was absent) → INSERT into result
else → no output change (row already present)
Delete on left side:
left_count -= 1
if left_count + right_count = 0 → DELETE from result
else → no output change (other copies remain)
This is the same reference-counting approach used for DISTINCT, extended to track counts per side.
INTERSECT: Present on Both Sides
SELECT product_id FROM warehouse_a
INTERSECT
SELECT product_id FROM warehouse_b;
Products available in both warehouses. The result includes a row only if it exists on both sides.
Row appears in result if left_count > 0 AND right_count > 0.
Insert on left side (new product in warehouse A):
left_count += 1
if left_count = 1 AND right_count > 0 → INSERT into result (now present on both sides)
Delete on left side:
left_count -= 1
if left_count = 0 AND right_count > 0 → DELETE from result (no longer on both sides)
Insert on right side: Mirror of left-side logic.
EXCEPT: Present on Left, Not on Right
SELECT customer_id FROM all_customers
EXCEPT
SELECT customer_id FROM opted_out_customers;
Customers who haven't opted out. The result includes a row if it's on the left side and not on the right side.
Row appears in result if left_count > 0 AND right_count = 0.
Insert on right side (customer opts out):
right_count += 1
if left_count > 0 AND right_count = 1 → DELETE from result (now excluded)
Delete on right side (customer opts back in):
right_count -= 1
if left_count > 0 AND right_count = 0 → INSERT into result (no longer excluded)
This is the anti-join behavior: adding to the right side removes from the result. It's the set-operation analog of NOT EXISTS.
The ALL Variants
INTERSECT ALL and EXCEPT ALL preserve multiplicities:
INTERSECT ALL: The result contains min(left_count, right_count) copies of each row.
EXCEPT ALL: The result contains max(left_count - right_count, 0) copies.
The delta rules are more complex because changing a count on one side can change the output multiplicity by more than 1. pg_trickle handles this by computing the before and after output counts and emitting the difference:
output_before = min(left_count_before, right_count_before) -- for INTERSECT ALL
output_after = min(left_count_after, right_count_after)
delta = output_after - output_before
if delta > 0: emit INSERT × delta
if delta < 0: emit DELETE × |delta|
Creating Stream Tables with Set Operations
SELECT pgtrickle.create_stream_table(
name => 'available_everywhere',
query => $$
SELECT product_id, product_name FROM warehouse_east
INTERSECT
SELECT product_id, product_name FROM warehouse_west
$$,
schedule => '10s'
);
When a product is added to warehouse_east, pg_trickle:
- Increments the left-side count for that product.
- Checks the right-side count.
- If the product is now in both warehouses → inserts into the result.
When a product is removed from warehouse_west:
- Decrements the right-side count.
- If the count drops to 0 → removes the product from the result.
Building Merge Tables from Heterogeneous Sources
A practical use of UNION ALL: combining data from multiple source systems into a unified view.
SELECT pgtrickle.create_stream_table(
name => 'all_contacts',
query => $$
SELECT 'crm' AS source, id, name, email FROM crm_contacts
UNION ALL
SELECT 'marketing' AS source, id, name, email FROM marketing_leads
UNION ALL
SELECT 'support' AS source, ticket_contact_id AS id, name, email FROM support_tickets
$$,
schedule => '5s'
);
Three different source systems, merged into one stream table. Changes to any source are reflected in the merged result within 5 seconds. Each branch maintains its own delta independently.
If you need deduplication across sources (same email from CRM and marketing → one row):
SELECT pgtrickle.create_stream_table(
name => 'unique_contacts',
query => $$
SELECT DISTINCT email, name FROM (
SELECT name, email FROM crm_contacts
UNION ALL
SELECT name, email FROM marketing_leads
UNION ALL
SELECT name, email FROM support_tickets
) all_sources
$$,
schedule => '10s'
);
The UNION ALL feeds into DISTINCT, which uses reference counting. A contact appearing in all three systems has __pgt_dup_count = 3. Remove them from CRM → count drops to 2. Remove from all three → count drops to 0 → removed from result.
Performance
Set operation delta costs:
| Operation | Left-side change cost | Right-side change cost |
|---|---|---|
| UNION ALL | O(delta) — direct pass-through | O(delta) |
| UNION | O(delta) — count check per row | O(delta) |
| INTERSECT | O(delta) — check other side's count | O(delta) |
| EXCEPT | O(delta) — check other side's count | O(delta) |
All operations are O(delta) per refresh cycle. The constant factor is slightly higher for deduplicating operations (UNION, INTERSECT, EXCEPT) because each changed row requires a count lookup on the other side.
Summary
pg_trickle maintains set operations incrementally using dual-count multiplicity tracking. Each result row knows how many copies exist on the left and right sides. The set operation's semantics (UNION: either side, INTERSECT: both sides, EXCEPT: left but not right) determine when the row appears in the output.
UNION ALL is a direct pass-through. UNION uses reference counting. INTERSECT requires presence on both sides. EXCEPT removes from the result when the right side gains a match.
For merging heterogeneous data sources, combining UNION ALL with DISTINCT gives you a continuously maintained, deduplicated merge table. All of it incremental. All of it O(delta).
← Back to Blog Index | Documentation
Testing Stream Tables: Shadow Mode and Correctness Fuzzing
How pg_trickle validates that differential refresh matches full refresh
Incremental view maintenance has a correctness invariant: the result of applying deltas incrementally must be identical to recomputing the query from scratch.
This sounds obvious, but it's surprisingly easy to violate. Edge cases in JOIN delta rules, NULL handling in aggregates, concurrent modifications during refresh, boundary conditions in GROUP BY with zero-count groups — each of these can produce a result that's silently wrong.
pg_trickle tests this invariant aggressively, with two complementary techniques: shadow mode (production validation) and SQLancer-based fuzzing (pre-release testing).
Shadow Mode
Shadow mode runs DIFFERENTIAL and FULL refresh in parallel on the same stream table and compares the results. If they diverge, it raises an alert.
Enabling Shadow Mode
SELECT pgtrickle.alter_stream_table(
'revenue_by_region',
shadow_mode => true
);
With shadow mode enabled, every refresh cycle:
- Runs the normal DIFFERENTIAL refresh (applies the delta).
- Runs a FULL refresh (recomputes from scratch) into a shadow table.
- Compares the two results row-by-row.
- If they match: normal operation continues.
- If they diverge: logs a warning with the differing rows and optionally raises an alert.
The DIFFERENTIAL result is the one that's committed to the stream table — shadow mode doesn't affect the data your application sees. The FULL refresh runs in a separate transaction and is discarded after comparison.
What Divergence Looks Like
WARNING: shadow mode divergence detected for "revenue_by_region"
Rows only in DIFFERENTIAL result:
(region='europe', revenue=150200.50, order_count=1203)
Rows only in FULL result:
(region='europe', revenue=150200.00, order_count=1203)
Divergence: 1 row(s), max delta: revenue differs by 0.50
This tells you that the differential engine computed a revenue of $150,200.50 for Europe, but the full recomputation says it should be $150,200.00. There's a 50-cent discrepancy — probably a rounding issue in the delta rule for a specific aggregation path.
When to Use Shadow Mode
-
After deploying a new pg_trickle version. Run shadow mode for a few hours or days on your most complex stream tables to validate that the new version's delta engine produces correct results.
-
On complex queries. Multi-table JOINs with nested aggregations and CASE expressions are where delta bugs are most likely to hide. Shadow mode catches them before users do.
-
As a canary. Enable shadow mode on one representative stream table permanently. If the differential engine ever regresses, you'll know immediately.
Performance Impact
Shadow mode roughly doubles the refresh cost — you're running both a DIFFERENTIAL and a FULL refresh every cycle. For a stream table that normally refreshes in 10ms, shadow mode takes about 20ms. For production use, pick a few representative tables rather than enabling it on everything.
SQLancer Fuzzing
SQLancer is a database testing tool that generates random SQL queries and checks them for correctness. pg_trickle's test suite uses SQLancer-based fuzzing to find delta engine bugs before they reach production.
How It Works
The fuzzer:
- Generates a random schema (tables with various column types).
- Generates a random query that's valid for IVM (JOINs, GROUP BYs, aggregates, filters).
- Creates a stream table with DIFFERENTIAL mode.
- Generates random DML (INSERTs, UPDATEs, DELETEs) against the source tables.
- Refreshes the stream table.
- Runs the defining query from scratch (FULL refresh) and compares.
- Repeats with more random DML.
If the DIFFERENTIAL result ever diverges from the FULL result, the fuzzer reports the schema, query, DML sequence, and the divergent rows. This is a minimal reproduction case that the team can investigate and fix.
What It's Found
Over the development of pg_trickle, SQLancer fuzzing has found:
-
NULL group handling:
GROUP BYon a nullable column where the group key transitions from NULL to non-NULL in an UPDATE. The delta rule was computing the old group's aggregate incorrectly. -
Empty group cleanup: When all rows in a GROUP BY group are deleted, the aggregate should be removed from the stream table. The delta engine was leaving zero-count groups in some JOIN configurations.
-
Multi-column update ordering: When an UPDATE changes both the JOIN key and a aggregated value in the same statement, the delta engine needs to process the key change before the value change. A specific three-table JOIN configuration triggered the wrong ordering.
-
CASE expression with NULL:
SUM(CASE WHEN x IS NULL THEN 0 ELSE x END)had a delta rule that didn't handle the transition from NULL to non-NULL correctly.
Each of these bugs was caught by the fuzzer in automated testing, before any release. The fixes are in pg_trickle's test suite as regression tests.
The Multiset Invariant
The correctness invariant that both shadow mode and fuzzing check is the multiset invariant:
DIFFERENTIAL_RESULT = FULL_RESULT
Where both sides are compared as multisets (bags, not sets). Row ordering doesn't matter. But duplicates do — if the differential result has two copies of a row and the full result has one, that's a divergence.
The comparison is done column-by-column with type-aware equality:
- Numeric columns are compared with configurable tolerance (default: exact).
- Timestamps are compared with microsecond precision.
- NULL values follow SQL NULL semantics (NULL = NULL for comparison purposes in this context).
- Array columns are compared as sorted sets.
Running the Fuzz Tests
pg_trickle's fuzz targets are in the fuzz/ directory:
# Run the differential correctness fuzzer
cargo +nightly fuzz run fuzz_differential -- -max_total_time=300
This runs for 5 minutes, generating random schemas, queries, and DML sequences. Any divergence is reported as a crash with a reproducer in fuzz/artifacts/.
The CI pipeline runs fuzzing on a daily schedule. If a new commit introduces a delta engine regression, the next day's fuzz run catches it.
Writing Custom Correctness Tests
If you have a specific query pattern you're concerned about, you can write a targeted correctness test:
-- Create the stream table
SELECT pgtrickle.create_stream_table(
'test_target',
$$SELECT region, SUM(amount) AS total
FROM orders JOIN customers ON customers.id = orders.customer_id
GROUP BY region$$,
schedule => '1s', refresh_mode => 'DIFFERENTIAL'
);
-- Make some changes
INSERT INTO orders (customer_id, amount) VALUES (1, 100);
UPDATE customers SET region = 'asia' WHERE id = 1;
DELETE FROM orders WHERE amount < 10;
-- Wait for refresh
SELECT pg_sleep(2);
-- Compare DIFFERENTIAL result with FULL recomputation
SELECT * FROM test_target
EXCEPT
SELECT region, SUM(amount) AS total
FROM orders JOIN customers ON customers.id = orders.customer_id
GROUP BY region;
-- Should return zero rows
If this returns any rows, the differential engine has a bug for your specific query pattern.
The Cost of Correctness
Shadow mode costs CPU (double refresh). Fuzzing costs CI time. Regression tests cost maintenance.
The alternative is deploying a delta engine that might silently produce wrong results. For a system whose entire value proposition is "your data is always correct and up to date," this isn't an option.
← Back to Blog Index | Documentation
Slowly Changing Dimensions in Real Time
SCD Type 2 without nightly ETL, without Airflow, without leaving PostgreSQL
Slowly changing dimensions are a data warehousing concept with a misleading name. There's nothing slow about them in practice — customer tiers change, product prices update, employee departments shift. The "slowly" just means the change frequency is lower than transactional data.
SCD Type 2 is the version where you keep history: when a customer moves from the "gold" tier to "platinum," you don't overwrite the old row. You close it (set valid_to = now()) and insert a new row (with valid_from = now(), valid_to = NULL). Every historical state is preserved.
The traditional implementation involves a nightly ETL job that compares today's snapshot with yesterday's, detects changes, closes old rows, and opens new ones. It runs in Airflow. It takes 45 minutes. If it fails, your dimension table is stale until someone notices.
pg_trickle can maintain SCD Type 2 tables continuously — no ETL, no scheduler outside PostgreSQL, no batch window.
The Setup
Start with a standard customer table that your application writes to:
CREATE TABLE customers (
id bigint PRIMARY KEY,
name text NOT NULL,
email text NOT NULL,
tier text NOT NULL DEFAULT 'standard',
region text NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);
When the application updates a customer's tier, it just runs UPDATE customers SET tier = 'platinum' WHERE id = 42. No SCD logic in the application.
The SCD Type 2 Stream Table
The SCD dimension is a stream table that tracks the history of changes:
-- Event log: capture every change to customers as an event
CREATE TABLE customer_changes (
id bigserial PRIMARY KEY,
customer_id bigint NOT NULL,
name text NOT NULL,
email text NOT NULL,
tier text NOT NULL,
region text NOT NULL,
changed_at timestamptz NOT NULL DEFAULT now()
);
-- Trigger to capture changes into the event log
CREATE OR REPLACE FUNCTION capture_customer_change() RETURNS trigger AS $$
BEGIN
INSERT INTO customer_changes (customer_id, name, email, tier, region, changed_at)
VALUES (NEW.id, NEW.name, NEW.email, NEW.tier, NEW.region, now());
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_customer_changes
AFTER INSERT OR UPDATE ON customers
FOR EACH ROW EXECUTE FUNCTION capture_customer_change();
Now build the SCD Type 2 dimension as a stream table:
SELECT pgtrickle.create_stream_table(
'dim_customer_scd2',
$$SELECT
c1.customer_id,
c1.name,
c1.tier,
c1.region,
c1.changed_at AS valid_from,
c2.changed_at AS valid_to
FROM customer_changes c1
LEFT JOIN customer_changes c2
ON c2.customer_id = c1.customer_id
AND c2.changed_at = (
SELECT MIN(c3.changed_at)
FROM customer_changes c3
WHERE c3.customer_id = c1.customer_id
AND c3.changed_at > c1.changed_at
)$$,
schedule => '2s',
refresh_mode => 'DIFFERENTIAL'
);
This produces a standard SCD Type 2 table:
| customer_id | name | tier | region | valid_from | valid_to |
|---|---|---|---|---|---|
| 42 | Alice | standard | europe | 2026-01-15 | 2026-03-22 |
| 42 | Alice | gold | europe | 2026-03-22 | 2026-04-10 |
| 42 | Alice | platinum | europe | 2026-04-10 | NULL |
The row with valid_to = NULL is the current state. Historical rows have both timestamps set.
What Happens When a Dimension Changes
When the application runs UPDATE customers SET tier = 'platinum' WHERE id = 42:
- The
capture_customer_changetrigger fires, inserting a new row intocustomer_changes. - pg_trickle's CDC trigger captures the insert into the change buffer.
- Within 2 seconds, the scheduler refreshes
dim_customer_scd2. - The differential engine computes the delta:
- The previous "current" row (tier = 'gold', valid_to = NULL) gets its
valid_toset. - A new "current" row (tier = 'platinum', valid_to = NULL) is inserted.
- The previous "current" row (tier = 'gold', valid_to = NULL) gets its
- The MERGE applies both changes atomically.
The SCD table is up to date within 2 seconds of the source change. No nightly batch. No Airflow DAG. No manual intervention.
Querying the SCD
Current state
SELECT * FROM dim_customer_scd2
WHERE customer_id = 42 AND valid_to IS NULL;
State at a point in time
SELECT * FROM dim_customer_scd2
WHERE customer_id = 42
AND valid_from <= '2026-03-25'
AND (valid_to IS NULL OR valid_to > '2026-03-25');
This returns the "gold" row — because on March 25, Alice was in the gold tier.
JOIN with fact tables
-- Revenue by customer tier at the time of each order
SELECT
d.tier AS tier_at_order_time,
SUM(o.amount) AS revenue
FROM orders o
JOIN dim_customer_scd2 d
ON d.customer_id = o.customer_id
AND o.created_at >= d.valid_from
AND (d.valid_to IS NULL OR o.created_at < d.valid_to)
GROUP BY d.tier;
This gives you revenue grouped by the customer's tier at the time of the order — not their current tier. This is the whole point of SCD Type 2.
SCD Type 1: Overwrite
If you don't need history — just the current state, always overwritten — that's even simpler. A stream table with the right query is already SCD Type 1:
SELECT pgtrickle.create_stream_table(
'dim_customer_current',
$$SELECT id AS customer_id, name, email, tier, region
FROM customers$$,
schedule => '2s',
refresh_mode => 'DIFFERENTIAL'
);
Every time a customer is updated, the stream table reflects the new values within 2 seconds. No history, no valid_from/valid_to — just the latest state.
Why Not Just Use Triggers?
You could implement SCD Type 2 with a trigger that directly maintains the dimension table. Many teams do. The problems:
-
Write-path coupling. The trigger runs in the application's transaction. If the SCD logic is slow (e.g., finding the previous row, closing it, inserting a new one), it adds latency to every customer update.
-
Correctness under concurrency. If two updates to the same customer happen concurrently, the trigger needs to handle the race condition. Getting the
valid_toassignment right under concurrent writes is non-trivial. -
No monitoring. If the trigger fails or produces incorrect data, you won't know until someone queries the dimension and gets wrong results.
-
Maintenance burden. Every schema change to the source table requires updating the trigger. Adding a column? Update the trigger. Renaming a column? Update the trigger.
With pg_trickle, the SCD logic is in a SQL query. Schema changes are handled by alter_stream_table. Monitoring is built in. Concurrency is handled by the engine's transaction isolation. The write path has no additional overhead beyond the CDC trigger (which is minimal and standard across all stream tables).
Combining with the Medallion Pattern
SCD Type 2 dimensions fit naturally into a medallion architecture:
- Bronze:
customerstable (mutable, application-facing) - Silver:
customer_changesevent log (append-only, captured by trigger) - Gold:
dim_customer_scd2stream table (SCD Type 2, maintained by pg_trickle)
The Gold layer is queryable by BI tools, analytics pipelines, and the application itself. It's always fresh. It's always correct. And it's just a PostgreSQL table.
← Back to Blog Index | Documentation
Snapshots: Time Travel for Stream Tables
Bookmark, compare, rollback, and bootstrap with point-in-time captures
You're about to deploy a migration that changes a stream table's defining query. If something goes wrong, you'd like to compare the new result with the old one. Or roll back entirely.
pg_trickle's snapshot system lets you do this. snapshot_stream_table() captures the current contents of a stream table into an ordinary PostgreSQL table. restore_from_snapshot() puts it back. The snapshot is a regular table — you can query it, join it, diff it, export it.
Taking a Snapshot
SELECT pgtrickle.snapshot_stream_table('revenue_by_region');
This creates a table named pgtrickle.snapshot_revenue_by_region_<timestamp> containing an exact copy of the current stream table contents.
You can also name the snapshot:
SELECT pgtrickle.snapshot_stream_table('revenue_by_region', 'before_migration');
This creates pgtrickle.snapshot_revenue_by_region_before_migration.
The snapshot is a plain table. It's not a stream table — it has no CDC triggers, no refresh schedule, no change buffers. It's a frozen point-in-time copy.
Listing Snapshots
SELECT * FROM pgtrickle.list_snapshots();
stream_table | snapshot_name | created_at | row_count
--------------------+-------------------+---------------------+-----------
revenue_by_region | before_migration | 2026-04-27 14:30:00 | 152,847
revenue_by_region | 20260427_143200 | 2026-04-27 14:32:00 | 152,903
customer_metrics | pre_reorg | 2026-04-26 09:15:00 | 48,291
Or filter by stream table:
SELECT * FROM pgtrickle.list_snapshots('revenue_by_region');
Restoring from a Snapshot
SELECT pgtrickle.restore_from_snapshot('revenue_by_region', 'before_migration');
This:
- Truncates the current stream table contents.
- Copies the snapshot data into the stream table.
- Resets the frontier — so the next refresh reads all changes since the snapshot was taken.
After restore, the stream table resumes normal operation. The next scheduled refresh processes the delta between the snapshot state and the current source data. If the source hasn't changed much since the snapshot, this delta is small and fast.
Warning: Restoring a snapshot doesn't revert the stream table's defining query or configuration. If you changed the query before restoring, the snapshot data will be refreshed using the new query. If the schemas don't match, the restore fails.
Use Case 1: Pre-Migration Safety Net
The most common use case. Before changing a stream table's query, take a snapshot:
-- Before migration
SELECT pgtrickle.snapshot_stream_table('order_summary', 'pre_migration_v2');
-- Apply the migration
SELECT pgtrickle.alter_stream_table(
name => 'order_summary',
query => $$ ... new query ... $$
);
-- Verify results look correct
SELECT COUNT(*) FROM pgtrickle.order_summary; -- new data
SELECT COUNT(*) FROM pgtrickle.snapshot_order_summary_pre_migration_v2; -- old data
-- Compare
SELECT * FROM pgtrickle.order_summary
EXCEPT
SELECT * FROM pgtrickle.snapshot_order_summary_pre_migration_v2;
If the new query is wrong, restore:
-- Revert the query
SELECT pgtrickle.alter_stream_table(
name => 'order_summary',
query => $$ ... original query ... $$
);
-- Restore the data
SELECT pgtrickle.restore_from_snapshot('order_summary', 'pre_migration_v2');
Use Case 2: Replica Bootstrap
When you add a new PostgreSQL replica, stream tables are empty until the first FULL refresh completes. For large stream tables, that first refresh can take minutes.
Snapshots offer a faster alternative:
- On the primary, take a snapshot:
snapshot_stream_table('big_summary'). pg_dumpthe snapshot table and restore it on the replica.- On the replica, restore:
restore_from_snapshot('big_summary', 'bootstrap').
The stream table is immediately queryable with the snapshot data. The next refresh processes only the changes since the snapshot, which is much faster than a full recomputation.
For streaming replication (not logical), this isn't necessary — the replica already has the stream table data via WAL replay. But for logical replication or fresh standby setups, it's a significant time saver.
Use Case 3: Forensic Comparison
Something changed in the business metrics. Revenue dropped 15% on Tuesday. Was it a data issue or a real business event?
If you have daily snapshots, you can compare:
-- Revenue by region: today vs. Monday
SELECT
t.region,
t.revenue AS today,
m.revenue AS monday,
t.revenue - m.revenue AS delta
FROM pgtrickle.revenue_by_region t
FULL OUTER JOIN pgtrickle.snapshot_revenue_by_region_monday m
USING (region)
ORDER BY delta;
This is trivial with snapshots and impossible without them (or without a separate time-series history system).
Use Case 4: Test Fixtures
Snapshot production data for deterministic test environments:
-- Production
SELECT pgtrickle.snapshot_stream_table('product_search', 'test_fixture_2026q2');
-- Export
pg_dump -t pgtrickle.snapshot_product_search_test_fixture_2026q2 prod_db > fixture.sql
-- Test environment
psql test_db < fixture.sql
SELECT pgtrickle.restore_from_snapshot('product_search', 'test_fixture_2026q2');
Your test environment now has a known-good starting state. Tests run against it, and results are reproducible.
Snapshot Storage
Snapshots are ordinary heap tables. They consume the same storage as the original stream table data — roughly the same number of pages. pg_trickle doesn't compress or deduplicate snapshots.
For a stream table with 1 million rows at 200 bytes per row:
- Stream table: ~200MB
- Each snapshot: ~200MB
Plan storage accordingly. If you take daily snapshots and retain 30 days, that's 6GB for a single stream table.
Retention and Cleanup
pg_trickle doesn't automatically delete old snapshots. You manage retention explicitly:
-- Drop a specific snapshot
SELECT pgtrickle.drop_snapshot('revenue_by_region', 'before_migration');
-- Drop all snapshots older than 7 days (manual query)
SELECT pgtrickle.drop_snapshot(stream_table, snapshot_name)
FROM pgtrickle.list_snapshots()
WHERE created_at < NOW() - INTERVAL '7 days';
For automated retention, wrap this in a cron job or a pg_cron task:
-- pg_cron: daily cleanup of snapshots older than 30 days
SELECT cron.schedule('snapshot-cleanup', '0 3 * * *', $$
SELECT pgtrickle.drop_snapshot(stream_table, snapshot_name)
FROM pgtrickle.list_snapshots()
WHERE created_at < NOW() - INTERVAL '30 days'
$$);
Snapshots vs. pg_dump
pg_dump backs up the entire database (or specific tables). Snapshots capture a single stream table's state within the running database.
| Snapshots | pg_dump | |
|---|---|---|
| Scope | One stream table | Whole database or selected tables |
| Speed | Instant (copy within same DB) | Depends on DB size |
| Storage | Same database | External file |
| Restore | restore_from_snapshot() | pg_restore |
| Cross-environment | No (same database) | Yes |
| Frontier reset | Yes (automatic) | Manual (repair_stream_table) |
Use snapshots for operational safety (pre-migration, comparison, quick rollback). Use pg_dump for disaster recovery and cross-environment transfers.
Summary
Snapshots are point-in-time copies of stream table contents, stored as ordinary tables.
snapshot_stream_table()to capture.restore_from_snapshot()to roll back.list_snapshots()to inventory.drop_snapshot()to clean up.
They're useful before migrations, for bootstrapping replicas, for forensic comparison, and for test fixtures. They're cheap to take (a table copy), and they compose with standard PostgreSQL tools (pg_dump, queries, joins).
If you're running stream tables in production and you're not taking snapshots before schema changes, you're flying without a parachute. It takes one function call to put one on.
← Back to Blog Index | Documentation
Soft Deletes and Tombstone Management in Differential IVM
How deleted_at patterns interact with delta propagation, and the right way to model soft deletion for stream tables
Soft deletes are everywhere. Instead of DELETE FROM users WHERE id = 42, you write UPDATE users SET deleted_at = now() WHERE id = 42. The row stays in the table, invisible to the application but available for audit, recovery, and compliance. It's a sensible pattern. It's also one that creates subtle correctness issues when combined with incremental view maintenance.
The problem is that a soft-deleted row is still physically present in the table. If your stream table's query doesn't filter on deleted_at, it will include "deleted" rows in its aggregates. If it does filter on deleted_at, then soft-deleting a row is semantically an UPDATE to the source table but functionally a DELETE from the stream table's perspective. Getting the delta propagation right for this case requires understanding how pg_trickle processes updates and what happens when a row transitions from "visible" to "invisible" in a filtered query.
The Naive Approach and Its Problems
Consider a stream table that counts active users per plan:
-- Source table with soft deletes
CREATE TABLE users (
id serial PRIMARY KEY,
plan text NOT NULL,
email text NOT NULL,
deleted_at timestamptz -- NULL means active
);
-- Stream table: count users per plan
SELECT pgtrickle.create_stream_table(
'plan_counts',
$$
SELECT plan, COUNT(*) AS user_count
FROM users
WHERE deleted_at IS NULL
GROUP BY plan
$$
);
When you soft-delete a user (UPDATE users SET deleted_at = now() WHERE id = 42), the following happens:
- The CDC trigger fires, recording the change: old row (deleted_at = NULL) → new row (deleted_at = now())
- pg_trickle processes the delta: the old row satisfied the
WHERE deleted_at IS NULLfilter, but the new row does not - From the stream table's perspective, this is a row removal — the user is no longer counted in
plan_counts - The count for that user's plan decreases by 1
This works correctly. pg_trickle's differential engine handles it because it sees the update as a simultaneous removal of the old row (weight -1) and insertion of the new row (weight +1). Since the new row doesn't pass the filter, only the removal propagates. The aggregate decreases.
Where It Gets Tricky: Un-Deleting
The soft delete pattern implies the ability to un-delete: UPDATE users SET deleted_at = NULL WHERE id = 42. This reverses the transition — a row that was invisible becomes visible again.
pg_trickle handles this correctly too: the old row (deleted_at = now()) doesn't pass the filter, but the new row (deleted_at = NULL) does. The stream table sees a new row appearing, and the count increases.
The issue isn't correctness — it's performance. Every update to any column in the users table triggers the CDC mechanism, which records the before and after image. If you frequently update non-relevant columns (last_login_at, session_count, etc.), the trigger fires for every one of those updates, even though they don't affect the stream table's result.
pg_trickle's differential engine will correctly determine that these updates don't change the stream table output (both old and new rows pass the filter with the same projected values, so the net delta is zero). But the trigger still fires, the change buffer still receives the event, and the refresh still processes it — only to discard it.
Optimizing With Column-Level Filtering
For tables with frequent updates to non-relevant columns, pg_trickle's column-level change detection ensures that only updates to columns referenced in the stream table's query generate meaningful deltas. The users table might see thousands of updates per second to last_login_at, but if the stream table only references plan and deleted_at, updates to other columns produce zero-deltas that are discarded early in the refresh pipeline.
The practical advice: when designing tables with soft deletes that back stream tables, keep the deleted_at column in the same table as the columns you're aggregating. Don't put it in a separate "metadata" table that requires a join — that would force the stream table to maintain a join, which is more expensive than filtering a column.
Ghost Rows in Aggregates
The most dangerous pattern with soft deletes is forgetting to filter:
-- WRONG: includes soft-deleted users in the count
SELECT pgtrickle.create_stream_table(
'plan_counts_buggy',
'SELECT plan, COUNT(*) AS user_count FROM users GROUP BY plan'
);
This stream table counts all users, including soft-deleted ones. It's technically correct (it reflects the table's physical state), but it's almost certainly not what the application intends. The dashboard shows inflated numbers. The billing system charges for inactive users. The capacity planning is wrong.
Worse, because the stream table is incrementally maintained, the bug is invisible during normal operation. New users appear, soft-deleted users stay counted, and the numbers only grow. You won't notice until someone asks why the "active users" metric never decreases despite daily churn.
The fix is simple but must be deliberate:
-- CORRECT: always filter soft-deleted rows in stream table definitions
SELECT pgtrickle.create_stream_table(
'plan_counts',
$$
SELECT plan, COUNT(*) AS user_count
FROM users
WHERE deleted_at IS NULL
GROUP BY plan
$$
);
Make it a code review rule: every stream table over a table with a deleted_at column must have WHERE deleted_at IS NULL unless there's an explicit reason to include soft-deleted rows.
Tombstone Accumulation and Performance
Soft deletes create a long-term storage problem. Rows accumulate in the table forever unless you periodically purge them. For the source table, this means bloat and slower index scans. For stream tables, it means the change buffer processes deletions when you eventually hard-delete old tombstones.
When you run a cleanup job:
-- Monthly tombstone purge: hard-delete rows soft-deleted more than 90 days ago
DELETE FROM users WHERE deleted_at < now() - interval '90 days';
This generates DELETE events in the CDC buffer. But because the stream table's query filters WHERE deleted_at IS NULL, and these rows already had deleted_at set (they were already invisible to the stream table), the deletes produce zero-deltas. The stream table's aggregates don't change.
pg_trickle handles this efficiently: the differential engine evaluates the deleted rows against the stream table's filter, determines they were already excluded, and skips them. The refresh processes the events but produces no output changes.
However, if you have many stream tables that join against the soft-deleted table, each one evaluates the purged rows independently. For a bulk purge of 1 million tombstones with 10 stream tables referencing that table, that's 10 million delta evaluations (each producing zero output). Not catastrophic, but worth scheduling during low-traffic windows.
The Temporal Soft Delete Pattern
A more sophisticated approach uses a validity range instead of a single timestamp:
CREATE TABLE users (
id serial PRIMARY KEY,
plan text NOT NULL,
email text NOT NULL,
valid_from timestamptz NOT NULL DEFAULT now(),
valid_to timestamptz -- NULL means currently active
);
This supports time-travel queries ("what was the user's plan on March 15th?") and makes the soft-delete semantic explicit. A user is "active" when valid_to IS NULL or valid_to > now().
For stream tables, this pattern works identically to deleted_at IS NULL filtering:
SELECT pgtrickle.create_stream_table(
'active_users_by_plan',
$$
SELECT plan, COUNT(*) AS user_count
FROM users
WHERE valid_to IS NULL
GROUP BY plan
$$
);
The advantage is clarity: the semantics are explicit in the schema, and the stream table filter is self-documenting.
Multi-Table Soft Deletes and Cascading
Real applications have related tables that all use soft deletes. An organization is soft-deleted, and all its users should become invisible:
-- Organizations and users both have soft deletes
CREATE TABLE organizations (
id serial PRIMARY KEY,
name text,
deleted_at timestamptz
);
CREATE TABLE users (
id serial PRIMARY KEY,
org_id integer REFERENCES organizations(id),
email text,
deleted_at timestamptz
);
A stream table that should only show users from active organizations:
SELECT pgtrickle.create_stream_table(
'active_user_directory',
$$
SELECT u.id, u.email, o.name AS org_name
FROM users u
JOIN organizations o ON o.id = u.org_id
WHERE u.deleted_at IS NULL
AND o.deleted_at IS NULL
$$
);
When an organization is soft-deleted, all its users disappear from the stream table — even though the users themselves weren't modified. pg_trickle detects the organization's deleted_at change, re-evaluates the join condition for all users in that organization, and removes them from the result.
This cascading visibility is handled correctly and incrementally. Only the users belonging to the soft-deleted organization are processed. Users in other organizations are untouched.
Best Practices Summary
-
Always filter
deleted_at IS NULLin stream table definitions over soft-deletable tables. Make this a linting rule. -
Keep
deleted_atin the same table as the columns your stream tables aggregate. Avoid needing a join just to check deletion status. -
Schedule tombstone purges during low-traffic periods. They generate CDC events that produce zero-deltas but still consume processing resources.
-
Use column-level change detection (pg_trickle's default behavior) to avoid processing irrelevant updates to non-referenced columns.
-
Test the un-delete path — verify that restoring a soft-deleted row correctly re-includes it in dependent stream tables.
-
Consider validity ranges (
valid_from/valid_to) for temporal data that needs time-travel semantics. Stream tables work equally well with this pattern.
Soft deletes are a schema pattern. Incremental view maintenance is an engine feature. Understanding how they interact prevents subtle correctness bugs and ensures your stream tables always reflect the intended application semantics — not the physical table contents.
← Back to Blog Index | Documentation
Spill-to-Disk and the Auto-Fallback Safety Net
What happens when your delta query exceeds work_mem — and how pg_trickle recovers
Your stream table has been running in DIFFERENTIAL mode for weeks. Refreshes take 5ms. Life is good.
Then marketing runs a campaign, and 50,000 orders land in 10 minutes. The delta query needs to join 50,000 changed rows with a million-row dimension table, aggregate the results, and apply the merge. The intermediate result set doesn't fit in work_mem. PostgreSQL spills to disk.
The refresh still completes — it just takes 2 seconds instead of 5 milliseconds. No data is lost. But if this happens every cycle, DIFFERENTIAL mode is doing more work than FULL would.
pg_trickle's spill-to-disk detection and auto-fallback handle this. When delta queries spill repeatedly, the system switches to FULL refresh until the situation stabilizes.
How Delta Queries Use Memory
A DIFFERENTIAL refresh executes a delta query that processes only the changed rows. The query plan typically includes:
- Change buffer scan: Read the changed rows from
pgtrickle_changes.changes_<oid>. - Join with current data: Join changed rows with source tables to compute the delta.
- Aggregation: Compute aggregate deltas (SUM, COUNT, etc.).
- MERGE: Apply the delta to the stream table.
Steps 2 and 3 may use hash tables, sort buffers, or other in-memory structures. These are bounded by PostgreSQL's work_mem setting (default 4MB).
When the intermediate result exceeds work_mem, PostgreSQL's executor writes the excess to temporary files on disk. This is called "spilling" or "temp file usage." It's not an error — it's the normal overflow mechanism. But disk I/O is orders of magnitude slower than memory access.
Detecting Spill
pg_trickle monitors temp_blks_written in the query execution statistics. After each DIFFERENTIAL refresh, it checks:
Did this refresh write temp blocks to disk?
This information is available from PostgreSQL's pg_stat_statements or the executor's instrumentation. pg_trickle records it in the refresh history:
SELECT
refresh_id,
refresh_mode,
duration_ms,
temp_blocks_written
FROM pgtrickle.get_refresh_history('order_summary')
ORDER BY refresh_id DESC
LIMIT 10;
refresh_id | refresh_mode | duration_ms | temp_blocks_written
------------+--------------+-------------+---------------------
1042 | DIFFERENTIAL | 5.2 | 0
1043 | DIFFERENTIAL | 5.1 | 0
1044 | DIFFERENTIAL | 2100.0 | 4096
1045 | DIFFERENTIAL | 1800.0 | 3584
1046 | FULL | 450.0 | 0
Refreshes 1044 and 1045 spilled to disk. Refresh 1046 was automatically switched to FULL.
The Spill Threshold
pg_trickle tracks consecutive spills using two GUC settings:
pg_trickle.spill_threshold_blocks (default: 1024)
The number of temp blocks written before a refresh is flagged as "spilled." Below this threshold, minor spills are ignored — they're usually caused by transient memory pressure and don't warrant a mode switch.
pg_trickle.spill_consecutive_limit (default: 3)
The number of consecutive spilled refreshes before pg_trickle switches the stream table to FULL refresh mode. Three consecutive spills is the signal that the workload has shifted and DIFFERENTIAL is no longer efficient.
The logic:
if temp_blocks_written > spill_threshold_blocks:
consecutive_spills += 1
else:
consecutive_spills = 0
if consecutive_spills >= spill_consecutive_limit:
switch to FULL refresh for this stream table
The Auto-Recovery
Switching to FULL is not permanent. pg_trickle continues monitoring the change ratio (how much data changed relative to the stream table size). When the change ratio drops below differential_max_change_ratio (default: 0.10), pg_trickle switches back to DIFFERENTIAL.
The typical sequence:
- Burst of changes → delta query spills → 3 consecutive spills → switch to FULL.
- FULL refresh handles the burst cleanly.
- Change rate returns to normal → change ratio drops below 10% → switch back to DIFFERENTIAL.
- Normal 5ms refreshes resume.
The entire cycle is automatic. No manual intervention.
Tuning merge_work_mem_mb
The most effective way to prevent spilling is to give delta queries more memory:
-- Increase merge work memory (default: auto-calculated)
SET pg_trickle.merge_work_mem_mb = 64;
This sets the work_mem specifically for pg_trickle's delta queries, independent of the global PostgreSQL work_mem setting. It prevents delta queries from competing with user queries for memory.
How to choose a value:
- Check the
temp_blocks_writtencolumn in refresh history. - Multiply by 8KB (PostgreSQL's block size) to get the spill volume.
- Set
merge_work_mem_mbto at least that amount plus a 50% margin.
Example: If spills are typically 3,000 blocks (24MB), set merge_work_mem_mb = 48.
When FULL Is Actually Better
Spilling isn't always a sign that DIFFERENTIAL mode is broken. Sometimes the delta is genuinely large enough that FULL refresh is the better strategy.
The crossover point depends on the query complexity and table size, but the general rule:
- Change ratio < 5%: DIFFERENTIAL is almost always faster, even if it spills slightly.
- Change ratio 5–15%: Gray zone. Spilling here suggests FULL might be competitive.
- Change ratio > 15%: FULL is usually faster. The delta query is processing so much data that it's approaching a full scan anyway, but with the overhead of the MERGE step.
pg_trickle's cost model considers all of this — change ratio, historical spill rates, and FULL refresh timing — when making the AUTO mode decision. The spill detection adds a corrective signal: "the cost model predicted DIFFERENTIAL would be fast, but it wasn't."
Monitoring Spill History
For a global view of spill behavior:
SELECT
st.name,
COUNT(*) FILTER (WHERE rh.temp_blocks_written > 0) AS spilled_refreshes,
COUNT(*) AS total_refreshes,
MAX(rh.temp_blocks_written) AS max_spill_blocks,
AVG(rh.duration_ms) FILTER (WHERE rh.temp_blocks_written > 0) AS avg_spill_duration_ms,
AVG(rh.duration_ms) FILTER (WHERE rh.temp_blocks_written = 0) AS avg_normal_duration_ms
FROM pgtrickle.pgt_stream_tables st
JOIN pgtrickle.get_refresh_history(st.name) rh ON true
GROUP BY st.name
HAVING COUNT(*) FILTER (WHERE rh.temp_blocks_written > 0) > 0
ORDER BY spilled_refreshes DESC;
This shows which stream tables are spilling, how often, and how much slower spilled refreshes are compared to normal ones.
Preventing Spills Proactively
Beyond increasing merge_work_mem_mb, there are structural approaches:
1. Use append_only for insert-only tables:
SELECT pgtrickle.create_stream_table(
name => 'event_counts',
query => $$ ... $$,
append_only => true
);
Append-only mode uses INSERT instead of MERGE, which requires less memory.
2. Shorten refresh intervals to keep deltas small: If a 10-second schedule accumulates 10,000 changes per cycle, a 2-second schedule accumulates 2,000. Smaller deltas are less likely to spill.
3. Add indexes to source tables on join keys: The delta query joins changed rows with source tables. Without indexes, these joins may hash-join the full table, consuming work_mem.
4. Use UNLOGGED change buffers (with caution):
SET pg_trickle.cleanup_use_truncate = on;
TRUNCATE is faster than DELETE for cleaning up large change buffers, reducing the per-cycle overhead.
Summary
When delta queries exceed work_mem, PostgreSQL spills to disk. pg_trickle detects this via temp_blocks_written and, after consecutive spills, automatically switches to FULL refresh until the workload stabilizes.
Tuning options:
merge_work_mem_mb— give delta queries more memory.spill_threshold_blocks— ignore minor spills.spill_consecutive_limit— control how quickly the fallback triggers.
The safety net is automatic and self-healing. Bursts cause a temporary switch to FULL. Normal operations resume automatically. Your data stays correct through all of it.
← Back to Blog Index | Documentation
Why Your Materialized Views Are Always Stale
(And How to Fix It in 5 Lines of SQL)
You have a dashboard. It runs a complex query over millions of rows. Without a materialized view it takes 8 seconds. With one, it takes 12 milliseconds. You shipped the materialized view two months ago, put a REFRESH MATERIALIZED VIEW in a cron job, and declared victory.
Last week a customer asked why their newly-submitted order wasn't showing in the dashboard totals. You checked. The cron job had silently failed. The view had been stale for four days.
This is the normal lifecycle of a PostgreSQL materialized view in production. Not a horror story — just the quiet, predictable friction that accumulates when you build derived data on top of a full-scan refresh model.
Here's what that friction costs, why the standard fixes don't work at scale, and how incremental view maintenance changes the equation.
What Materialized Views Actually Do
PostgreSQL's MATERIALIZED VIEW caches the result of a query. That's it. When you write REFRESH MATERIALIZED VIEW orders_summary, PostgreSQL executes the underlying SELECT in full, writes the results to disk, and replaces the cached copy.
The critical word is full. Every refresh scans every row in every source table referenced by the query, regardless of what changed. If your orders_summary view aggregates 50 million orders and three orders were placed since the last refresh, PostgreSQL still scans all 50 million.
This is fine when:
- The underlying tables are small
- You don't care about real-time freshness (data warehouse use case)
- Refreshes run on a schedule where staleness is acceptable
It breaks down when:
- Tables grow to tens of millions of rows
- Refreshes start taking minutes
- Freshness matters (operational dashboards, real-time analytics, user-facing data)
And it fails entirely when:
- Freshness is measured in seconds, not minutes
- The cost of a full scan exceeds the cost of the query the view was meant to replace
At that point, your materialized view has stopped being a cache and started being a liability.
The Three Fixes That Don't Actually Fix It
When teams hit the staleness wall, they try variations of the same three solutions.
Fix 1: Refresh More Often
The cron job runs every hour. Make it run every minute. Make it run every 10 seconds.
The problem is that REFRESH MATERIALIZED VIEW locks the view while it refreshes. During the refresh, no queries can read from it. For a view that takes 500ms to refresh, running it every 10 seconds means it's locked 5% of the time. Run it every second and it's locked 50% of the time.
PostgreSQL has REFRESH MATERIALIZED VIEW CONCURRENTLY which avoids the lock by maintaining two copies and atomically swapping, but it requires a unique index and takes roughly twice as long. The staleness improves; the cost doubles.
At some scale, you hit a ceiling: refreshes take longer than the interval between them. You can't run a 30-second refresh every 10 seconds.
Fix 2: Partial Refresh with Manual Change Tracking
Some teams add an updated_at column to source tables and write refresh logic that only processes rows changed since the last run.
-- "Incremental" refresh for orders summary
INSERT INTO orders_summary
SELECT customer_id, SUM(total), COUNT(*)
FROM orders
WHERE updated_at > last_refresh_time
GROUP BY customer_id
ON CONFLICT (customer_id) DO UPDATE
SET total = orders_summary.total + EXCLUDED.total,
order_count = orders_summary.order_count + EXCLUDED.order_count;
This is closer to the right idea. But it's brittle in practice:
- It only works for INSERTs. Deletes and updates require separate handling.
- The ON CONFLICT delta logic is hand-coded and error-prone. If the same
customer_idappears in two separate change windows, you can double-count. updated_atcolumns need to be maintained on every source table. Missing one breaks the logic.- Multi-table JOINs make the "what changed" tracking exponentially harder. If
ordersjoins tocustomersand a customer'sregionchanges, you need to recompute that customer's regional summary even though no orders changed.
This approach works at small scale. Teams rebuild it correctly once, and then spend the next year patching edge cases.
Fix 3: Move to a Different System
Elasticsearch, ClickHouse, Apache Flink, Materialize. These systems have proper incremental processing. They work. They also mean you're now running two data stores, with all the synchronization and consistency problems that entails.
For teams that don't need PostgreSQL-native queries, this is a legitimate choice. For the rest, it's the infrastructure equivalent of moving to a bigger house because you can't find a good plumber.
What Incremental View Maintenance Actually Means
The core insight behind IVM is simple: for most queries, you don't need to recompute the full result when an input changes. You need to compute the delta and apply it.
For SUM(total):
- Row inserted with
total = 150: new sum = old sum + 150 - Row deleted with
total = 150: new sum = old sum - 150 - Row updated from
total = 150tototal = 200: new sum = old sum + 50
For COUNT(*):
- Row inserted: new count = old count + 1
- Row deleted: new count = old count - 1
These are O(1) operations. They don't depend on the size of the table. A table with 50 million rows and a table with 50 rows update in the same time if the delta has one row.
The challenge is that this algebraic property — the ability to express "how does the result change?" as a closed-form operation — doesn't hold for every query. Some aggregates (like MEDIAN) don't have a simple inverse. Some window functions require seeing the full neighborhood. But the aggregates that matter most in practice — SUM, COUNT, AVG, MIN/MAX with caveats, and now vector averages — all support it.
pg_trickle: IVM as a PostgreSQL Extension
pg_trickle implements IVM inside PostgreSQL using a combination of trigger-based CDC and a differential dataflow engine.
The workflow is different from REFRESH MATERIALIZED VIEW. Instead of defining how to recompute, you define what you want maintained:
SELECT pgtrickle.create_stream_table(
name => 'orders_summary',
query => $$
SELECT
c.region,
date_trunc('day', o.created_at) AS order_date,
SUM(o.total) AS revenue,
COUNT(*) AS order_count,
AVG(o.total) AS avg_order_value
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY c.region, date_trunc('day', o.created_at)
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
That's 5 lines. orders_summary is now a real table in your schema, queryable like any other table, with an HNSW or B-tree index if you want one.
When an order is inserted, pg_trickle's CDC triggers capture the change into a buffer. Every 5 seconds (or whatever interval you configure), the background worker drains the buffer, computes the delta using the DVM engine, and applies it to orders_summary. Only the affected rows change.
If 3 orders are placed in the europe region on a Monday, only one row in orders_summary is updated — the (europe, 2026-04-27) aggregate. The other 365×N region/day rows are untouched.
What Changes
Staleness
Your view is at most schedule seconds stale, continuously, without a cron job. The default schedule is 5 seconds. You can set it to 1 second for high-frequency data.
SELECT pgtrickle.alter_stream_table('orders_summary', schedule => '1 second');
There's no drift: every refresh cycle covers exactly the changes since the last one. There's no "catch-up" after a failure — the change buffers accumulate until the worker processes them, and the result is always consistent.
Cost
A refresh cycle that changes 10 rows costs roughly the same whether your source tables have 1 million rows or 1 billion. The differential engine processes deltas, not the full dataset. The cost scales with the number of changes per cycle, not the size of the data.
For a table that's updated continuously by a high-write workload, expect a refresh cycle to take 10–100ms depending on the number of changed rows and the complexity of the query.
Correctness
The CDC triggers run inside the same transaction as the change that caused them. If the original transaction rolls back, the change buffer entry is also rolled back. This means the view never reflects changes from transactions that didn't commit — a property that's surprisingly hard to guarantee with external batch pipelines.
The JOIN Case: Where Batch Refreshes Fall Apart
The hardest case for batch incremental refresh is multi-table JOINs. Consider:
SELECT
c.region,
SUM(o.total) AS regional_revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY c.region;
If a customer changes region — customer_id = 42 moves from europe to north_america — then:
- Every order for
customer_id = 42needs to move fromeuropetonorth_americain the aggregate. - This requires knowing the orders for that customer.
- It also requires recomputing both the old and new regional aggregates.
A batch job keyed on orders.updated_at won't see this change at all — no orders were updated. A batch job keyed on customers.updated_at will see the customer record, but needs to then find and reprocess all their orders.
This is the join-delta problem, and it's why manual incremental refresh usually breaks down to "recompute everything for affected keys" — which is effectively a full-scan refresh for busy customers.
pg_trickle handles this through the DVM engine's join-delta rules. When customers row changes, the engine identifies the set of affected join keys, retrieves the relevant order aggregates using index lookups, and applies the correct delta. No full scan.
When to Stick with REFRESH MATERIALIZED VIEW
pg_trickle isn't the right tool for every materialized view. Specifically:
-
Reporting/warehouse views: If you run
REFRESH MATERIALIZED VIEWonce a day for a BI dashboard and daily staleness is acceptable, that's fine. The simplicity of the built-in mechanism beats the operational overhead of a new extension. -
Very complex queries: IVM requires that the query be expressible as a composition of differentiable operators. Some queries — particularly those with
DISTINCT,EXCEPT, correlated subqueries, or non-monotone aggregates — require a full refresh even in pg_trickle. The extension is transparent about this: create the stream table withrefresh_mode => 'FULL'for queries that need it, andDIFFERENTIALfor the ones that don't. -
One-time or infrequent data: If your source data is loaded in bulk once a day and doesn't change between loads, IVM overhead isn't justified. Use
REFRESH MATERIALIZED VIEWafter the bulk load.
The sweet spot for pg_trickle is operational data that changes continuously — orders, events, user actions, metrics, search corpora — where staleness causes user-visible problems and full-scan refreshes are either too slow or too expensive.
A Concrete Example: A Support Team Dashboard
This is the kind of use case where the before/after is clean.
Before:
-- Ran as a cron job every 5 minutes
REFRESH MATERIALIZED VIEW CONCURRENTLY support_metrics;
-- Takes 45 seconds, locks the table for 90 seconds on busy days
-- Staleness: up to 5 minutes plus 45 seconds
After:
SELECT pgtrickle.create_stream_table(
name => 'support_metrics',
query => $$
SELECT
t.team_id,
t.name AS team_name,
COUNT(CASE WHEN tk.status = 'open' THEN 1 END) AS open_tickets,
COUNT(CASE WHEN tk.status = 'urgent' THEN 1 END) AS urgent_tickets,
AVG(EXTRACT(EPOCH FROM (tk.resolved_at - tk.created_at)) / 3600)
FILTER (WHERE tk.resolved_at IS NOT NULL) AS avg_resolution_hours,
MAX(tk.created_at) AS latest_ticket_at
FROM teams t
JOIN tickets tk ON tk.team_id = t.id
GROUP BY t.team_id, t.name
$$,
schedule => '3 seconds',
refresh_mode => 'DIFFERENTIAL'
);
Staleness: 3 seconds, maximum. Cost per refresh cycle: proportional to the number of new/changed tickets since last cycle — usually a handful. A dashboard page load goes from "wait 50ms for the MV query, which is sometimes stale" to "wait 2ms for the stream table read, always current."
The 5 Lines
SELECT pgtrickle.create_stream_table(
name => 'your_summary',
query => $$ /* your existing MV query here */ $$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
If the DVM engine can't differentiate your query, it tells you at creation time. Fix it (usually by removing DISTINCT, rewiring a complex aggregate) or set refresh_mode => 'FULL' and schedule it as aggressively as the full-scan cost allows. At least the schedule is managed, monitored, and correct.
The cron job goes away. The staleness alert goes away. The view is just always fresh.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Stop Rebuilding Your Search Index at 3am
pg_trickle's scheduler, SLA tiers, and how to tune refresh for your workload
If you're running periodic REFRESH MATERIALIZED VIEW or custom batch jobs to keep derived data fresh, you've made a decision — probably implicitly — about when to do that work and how to prioritize it.
Most teams make this decision once ("run it every 15 minutes at :00 and :15") and never revisit it. The result is a cron job that runs at 3am, takes 40 minutes on a big table, and occasionally conflicts with the morning ETL load at 7am. The on-call rotation includes "check if the refresh job finished before the standup."
pg_trickle's scheduler is designed to replace that mental model. This post explains how it works, what the tuning knobs are, and how to use SLA tiers to serve different workloads correctly.
How the Scheduler Works
pg_trickle runs a background worker — a PostgreSQL background process that starts with the database and persists for its lifetime. The scheduler is one of the background worker's responsibilities.
At each tick, the scheduler checks all registered stream tables and determines which ones are due for a refresh. "Due" is determined by:
- The stream table's
schedule— how often it should refresh - Whether the change buffer has data to process
- The current resource usage (CPU, I/O) — controlled by backpressure GUCs
- The SLA tier's priority rules
The scheduler doesn't run all refreshes at exactly their scheduled time. It maintains a priority queue and processes refreshes in priority order, subject to concurrency limits. High-priority stream tables preempt lower-priority ones when they're due.
The schedule Parameter
The simplest knob:
SELECT pgtrickle.create_stream_table(
name => 'live_orders',
query => $$ ... $$,
schedule => '5 seconds' -- refresh up to every 5 seconds
);
SELECT pgtrickle.create_stream_table(
name => 'daily_summary',
query => $$ ... $$,
schedule => '1 hour' -- refresh up to every hour
);
The schedule is a maximum interval, not a fixed interval. If the change buffer is empty (nothing changed), the refresh is skipped. This means a stream table that says "refresh every 5 seconds" but whose source data hasn't changed will actually skip most cycles.
This is important for cost. An empty refresh cycle costs almost nothing — just a check of the change buffer. You can set aggressive schedules on stream tables that are usually quiet without paying a significant overhead for the quiet periods.
SLA Tiers
Different stream tables have different importance. A user-facing search corpus that drives revenue needs different treatment from a background analytics table used by an internal reporting dashboard.
pg_trickle's SLA tiers let you declare this importance explicitly:
SELECT pgtrickle.alter_stream_table(
'live_product_search',
sla_tier => 'critical'
);
SELECT pgtrickle.alter_stream_table(
'daily_revenue_summary',
sla_tier => 'standard'
);
SELECT pgtrickle.alter_stream_table(
'historical_analytics',
sla_tier => 'background'
);
Three built-in tiers, in descending priority:
| Tier | Priority | When to use |
|---|---|---|
critical | Highest | User-facing, latency-sensitive. Preempts everything. |
standard | Default | Normal operational data. Runs when resources allow. |
background | Lowest | Batch analytics, archives. Runs in idle time. |
The scheduler always processes critical refreshes before standard, and standard before background. Under load — high write volume, many stream tables competing for refresh bandwidth — background tables may be delayed significantly, but critical tables are always processed first.
Concurrency: How Many Refreshes Run in Parallel
By default, pg_trickle runs two parallel refresh workers. You can tune this:
-- In postgresql.conf or SET:
pg_trickle.max_parallel_workers = 4
With 4 workers, up to 4 stream tables can refresh simultaneously. The scheduler assigns refreshes to idle workers and respects the SLA priority ordering.
Be careful not to set this too high. Each refresh worker opens a PostgreSQL connection and may perform index lookups, writes, and MERGE statements on stream tables. Too many concurrent refreshes can create I/O saturation or lock contention on heavily shared tables.
For most workloads, 2–4 workers is appropriate. For systems with many independent stream tables (tens to hundreds) and fast storage, more workers may help.
Backpressure: Preventing Refresh from Overwhelming Your Database
The harder scheduling problem is protecting your database from refresh overhead during high-write periods.
When source tables are being written very quickly — a bulk import, a traffic spike, a viral event — the change buffers fill up rapidly. If pg_trickle tries to process all those changes immediately, the refresh workers compete with the application for I/O bandwidth and lock resources.
The backpressure mechanism limits this:
-- In postgresql.conf:
pg_trickle.backpressure_enabled = on
pg_trickle.backpressure_max_lag_mb = 128 -- pause if WAL lag exceeds 128MB
pg_trickle.backpressure_lag_check_interval = 5s -- check every 5 seconds
When WAL lag exceeds the threshold — a sign that the database is under write load — the scheduler pauses background-tier refreshes and slows standard-tier refreshes. Critical-tier refreshes continue at full speed.
The effect: during a bulk import or traffic spike, your user-facing search corpora stay fresh (critical tier). Your analytics summaries fall slightly behind (standard tier slows). Your background reports accumulate a backlog that drains after the spike passes. The application never sees resource contention from the refresh workers.
Monitoring the Scheduler
The simplest view:
SELECT
name,
sla_tier,
schedule,
last_refresh_at,
EXTRACT(EPOCH FROM (NOW() - last_refresh_at))::int AS staleness_secs,
rows_changed_last_cycle,
avg_refresh_ms,
pending_change_rows
FROM pgtrickle.stream_table_status()
ORDER BY staleness_secs DESC;
When something's off:
-- Which tables have fallen behind their schedule?
SELECT name, schedule, staleness_secs
FROM pgtrickle.stream_table_status()
WHERE staleness_secs > EXTRACT(EPOCH FROM schedule::interval) * 3;
-- Is the change buffer accumulating a backlog?
SELECT source_table, pending_rows, oldest_change_at,
EXTRACT(EPOCH FROM (NOW() - oldest_change_at))::int AS backlog_age_secs
FROM pgtrickle.change_buffer_status()
ORDER BY pending_rows DESC;
A high backlog_age_secs on a source table means the scheduler is behind. This happens during traffic spikes (expected) or when the refresh is taking too long (investigate).
Diagnosing Slow Refreshes
When a stream table's refresh is consistently slower than expected:
-- Refresh history with timing breakdown
SELECT
refreshed_at,
duration_ms,
rows_changed,
delta_compute_ms,
merge_apply_ms,
index_update_ms
FROM pgtrickle.refresh_history('slow_stream_table')
ORDER BY refreshed_at DESC
LIMIT 20;
The three timing components:
delta_compute_ms: Time to compute the delta (join lookups, aggregate updates)merge_apply_ms: Time to apply the delta to the stream table via MERGEindex_update_ms: Time for index maintenance after the MERGE
If delta_compute_ms is high, look at missing indexes on source tables' join columns. If merge_apply_ms is high, the delta is large (many rows changed in one cycle) — consider a more frequent schedule to keep deltas smaller. If index_update_ms is high, the stream table has expensive indexes (large HNSW, many B-tree indexes) — this is expected and should be factored into your schedule.
A Real Tuning Example
Starting configuration, before tuning:
- 15 stream tables, all with default
schedule = '1 minute' - 2 refresh workers
- No SLA tiers set
- User complaints: search results sometimes 3–5 minutes stale during peak hours
Post-tuning:
-- User-facing search: critical tier, aggressive schedule
SELECT pgtrickle.alter_stream_table(
'product_search', sla_tier => 'critical', schedule => '5 seconds'
);
-- Operational dashboards: standard tier
SELECT pgtrickle.alter_stream_table(
'support_metrics', sla_tier => 'standard', schedule => '15 seconds'
);
SELECT pgtrickle.alter_stream_table(
'inventory_status', sla_tier => 'standard', schedule => '30 seconds'
);
-- Analytics: background tier, let them be a few minutes stale
SELECT pgtrickle.alter_stream_table(
'daily_revenue', sla_tier => 'background', schedule => '5 minutes'
);
SELECT pgtrickle.alter_stream_table(
'historical_funnel', sla_tier => 'background', schedule => '15 minutes'
);
-- Increase workers to handle the volume
ALTER SYSTEM SET pg_trickle.max_parallel_workers = 4;
SELECT pg_reload_conf();
-- Enable backpressure for bulk-import protection
ALTER SYSTEM SET pg_trickle.backpressure_enabled = on;
ALTER SYSTEM SET pg_trickle.backpressure_max_lag_mb = 64;
SELECT pg_reload_conf();
Result: product_search is now refreshed every 5 seconds with critical priority. During peak load, it maintains that cadence while background analytics tables fall a few minutes behind. User complaints about stale search results: zero.
The 3am Problem
The reason index rebuilds and MV refreshes happen at 3am is that they're too expensive to run during the day. The solution is to make the per-cycle cost small enough that it doesn't matter when it runs.
Differential refreshes with small deltas accomplish this. A 5-second cycle that processes 50 changed rows takes 10–20ms. You can run that continuously, around the clock, and it's invisible in your I/O metrics.
The 3am batch window is a symptom of using the wrong refresh model — full scans on a schedule. When you move to incremental, the concept of "a time when it's safe to refresh" goes away. The work is continuous and small, not periodic and large.
This also means the index is always current. Not "as of midnight last night" current. Not "as of the last time someone remembered to kick off the job" current. Always current, within the configured schedule.
The 3am maintenance window goes away. The on-call step disappears from the runbook. The cron job gets deleted. The monitoring alert for "did the job finish before market open" — gone.
That's the real value proposition of continuous incremental maintenance. Not just speed. The elimination of an entire class of operational work.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Streaming to Kafka Without Kafka Expertise
How pgtrickle-relay bridges stream table deltas to external systems
You have pg_trickle maintaining stream tables in PostgreSQL. Now your analytics team wants those deltas in Kafka. Your mobile team wants webhook notifications. Your data science team wants events in NATS for a real-time ML feature store.
The traditional approach: write a Kafka producer in Python, poll the database, serialize the deltas, handle offsets, deal with exactly-once semantics, set up monitoring, and maintain it forever.
The pg_trickle approach: run a single binary.
What pgtrickle-relay Does
pgtrickle-relay is a standalone Rust binary that reads from pg_trickle's outbox tables and writes to external messaging systems. It's not a library, not a framework, not an SDK. It's a process you run next to your PostgreSQL instance.
PostgreSQL External Systems
┌─────────────────┐ ┌──────────────┐
│ stream table │ │ Kafka topic │
│ ↓ delta │ └──────▲───────┘
│ outbox table │──→ pgtrickle-relay ──→ │
│ (transactional) │ │ │
└─────────────────┘ │ ┌──────▲───────┐
├──→ │ NATS subject │
│ └──────────────┘
│ ┌──────────────┐
└──→ │ HTTP webhook │
└──────────────┘
Supported sinks: Kafka, NATS JetStream, SQS, RabbitMQ, Redis Streams, HTTP webhooks.
Supported sources (for the inbox direction): the same list, reversed. External events come in, land in an inbox table, and stream tables can read from them.
Setup
1. Enable the outbox on a stream table
SELECT pgtrickle.enable_outbox('revenue_by_region');
This creates an outbox table that captures every delta produced by revenue_by_region. Each refresh cycle that produces non-empty changes writes an outbox row in the same transaction as the MERGE.
2. Configure the relay
# relay.toml
[global]
postgres_url = "postgres://user:pass@localhost/mydb"
metrics_bind = "0.0.0.0:9090"
[[pipeline]]
name = "revenue-to-kafka"
source = { type = "outbox", stream_table = "revenue_by_region" }
sink = { type = "kafka", brokers = "kafka:9092", topic = "revenue-deltas" }
[[pipeline]]
name = "orders-to-nats"
source = { type = "outbox", stream_table = "order_view" }
sink = { type = "nats", url = "nats://localhost:4222", subject = "orders.>" }
[[pipeline]]
name = "alerts-to-webhook"
source = { type = "outbox", stream_table = "fraud_alerts" }
sink = { type = "webhook", url = "https://hooks.example.com/fraud", method = "POST" }
3. Run it
pgtrickle-relay --config relay.toml
That's it. No Kafka consumer code. No NATS client library. No webhook retry logic.
What Gets Sent
Each outbox message is a JSON envelope containing:
{
"stream_table": "revenue_by_region",
"sequence": 42,
"timestamp": "2026-04-27T10:15:03.412Z",
"delta": {
"inserted": [
{"region": "europe", "day": "2026-04-27", "revenue": 150200.50, "order_count": 1203}
],
"deleted": [
{"region": "europe", "day": "2026-04-27", "revenue": 149800.00, "order_count": 1201}
]
},
"metadata": {
"refresh_mode": "DIFFERENTIAL",
"refresh_duration_ms": 12,
"delta_rows": 2
}
}
The delta is the exact set of rows that changed in the stream table — rows removed (old values) and rows added (new values). For an aggregate that was updated, the old aggregate value is in deleted and the new value is in inserted.
Consumers don't need to know about pg_trickle, PostgreSQL, or IVM. They receive a JSON message with the before/after state of the affected rows. They can build their own materialization from the delta stream.
Subject Routing
For NATS and Kafka, you can route messages to different subjects/topics based on the delta content:
[[pipeline]]
name = "regional-revenue"
source = { type = "outbox", stream_table = "revenue_by_region" }
sink = {
type = "nats",
url = "nats://localhost:4222",
subject_template = "revenue.{{ region }}"
}
A delta for the europe region goes to revenue.europe. A delta for asia goes to revenue.asia. Consumers subscribe to only the regions they care about.
High Availability
In production you run multiple relay instances. They coordinate via PostgreSQL advisory locks:
[global]
ha_group = "relay-primary"
One instance acquires the advisory lock and becomes the active leader. The others are standby. If the leader crashes, a standby acquires the lock within seconds and continues from where the leader left off.
The outbox table stores the consumer offset. A new leader reads the last committed offset and resumes. No messages are lost. Some messages may be delivered twice during failover — consumers should be idempotent (which they should be anyway).
The Inbox Direction
The relay also works in reverse. External events can flow into PostgreSQL:
[[pipeline]]
name = "payments-from-kafka"
source = { type = "kafka", brokers = "kafka:9092", topic = "payment-events", group_id = "pgtrickle-inbox" }
sink = { type = "inbox", inbox_name = "payment_events" }
Events land in pgtrickle.inbox_payment_events. You can build stream tables on top of the inbox:
-- Create the inbox
SELECT pgtrickle.create_inbox('payment_events');
-- Stream table over incoming payment events
SELECT pgtrickle.create_stream_table(
'payment_summary',
$$SELECT
customer_id,
SUM((payload->>'amount')::numeric) AS total_paid,
COUNT(*) AS payment_count
FROM pgtrickle.inbox_payment_events
WHERE processed = false
GROUP BY customer_id$$,
schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);
Events from Kafka are now queryable as a PostgreSQL table, with incremental aggregation on top.
Monitoring
The relay exposes Prometheus metrics at /metrics:
# Messages delivered successfully
pgtrickle_relay_messages_delivered_total{pipeline="revenue-to-kafka"} 12847
# Delivery latency (histogram)
pgtrickle_relay_delivery_duration_seconds_bucket{pipeline="revenue-to-kafka",le="0.01"} 12500
# Consumer lag (messages pending)
pgtrickle_relay_consumer_lag{pipeline="revenue-to-kafka"} 3
# Errors
pgtrickle_relay_delivery_errors_total{pipeline="revenue-to-kafka"} 0
And a health endpoint at /health that returns the status of each pipeline.
When to Use the Relay vs. Direct Outbox Polling
If your consumer is a PostgreSQL-native application (another service that queries the database), use pgtrickle.poll_outbox() directly. No relay needed.
If your consumer is an external system that speaks Kafka, NATS, HTTP, or any other messaging protocol — use the relay. It handles serialization, delivery, retries, offset tracking, and HA. Writing a custom consumer for each sink is the kind of infrastructure work that seems small and grows into a maintenance burden.
The relay is also useful when you need fan-out: one stream table's deltas going to multiple sinks. Each pipeline runs independently, with its own offset tracking.
← Back to Blog Index | Documentation
Structured Logging and OpenTelemetry for Stream Tables
When grep isn't enough: JSON events, correlation IDs, and observability integration
Your stream table failed. The log says:
ERROR: stream table 'order_summary' refresh failed: division by zero
OK. But which refresh cycle? What was the change ratio? How long did it run before failing? Was it DIFFERENTIAL or FULL? Which source tables had changes?
With pg_trickle.log_format = text (the default), these questions require correlating multiple log lines by timestamp, hoping they're adjacent, and parsing free-form text.
With pg_trickle.log_format = json, every log event is a structured JSON object with consistent fields. Pipe it to Loki, Datadog, Elasticsearch, or any log aggregator, and every question has an indexed answer.
Enabling JSON Logging
-- In postgresql.conf or via ALTER SYSTEM
ALTER SYSTEM SET pg_trickle.log_format = 'json';
SELECT pg_reload_conf();
After reload, pg_trickle's log output changes from:
LOG: pg_trickle: refreshing 'order_summary' (DIFFERENTIAL, 42 changes)
LOG: pg_trickle: refresh complete 'order_summary' (4.7ms, 3 rows changed)
To:
{"event":"refresh_start","pgt_name":"order_summary","pgt_id":17,"cycle_id":"c-20260427-101500-017","refresh_mode":"DIFFERENTIAL","changes_pending":42,"ts":"2026-04-27T10:15:00.123Z"}
{"event":"refresh_complete","pgt_name":"order_summary","pgt_id":17,"cycle_id":"c-20260427-101500-017","refresh_mode":"DIFFERENTIAL","duration_ms":4.7,"rows_inserted":2,"rows_updated":1,"rows_deleted":0,"ts":"2026-04-27T10:15:00.128Z"}
Event Taxonomy
pg_trickle emits structured events for all major operations:
| Event | When | Key Fields |
|---|---|---|
refresh_start | Refresh begins | pgt_name, cycle_id, refresh_mode, changes_pending |
refresh_complete | Refresh succeeds | duration_ms, rows_inserted/updated/deleted |
refresh_error | Refresh fails | error_code, error_message, duration_ms |
mode_fallback | DIFFERENTIAL → FULL | reason, change_ratio, threshold |
cdc_transition | Trigger → WAL (or back) | direction, source_table, reason |
scheduler_cycle | Scheduler wakes | tables_checked, tables_refreshed, loop_duration_ms |
worker_dispatch | Worker assigned to refresh | pgt_name, worker_id, database |
scc_converge | Cycle converges | scc_id, iterations, tables |
scc_timeout | Cycle hits max iterations | scc_id, iterations, remaining_changes |
drain_start | Drain mode entered | inflight_count |
drain_complete | Drain finished | drain_duration_ms |
cache_evict | L0 cache entry evicted | pgt_id, reason, cache_size |
spill_detected | Delta query spilled to disk | pgt_name, temp_blocks, consecutive_count |
backpressure_engaged | WAL slot lag exceeded | source_table, lag_bytes, threshold |
backpressure_released | WAL slot lag recovered | source_table, lag_bytes |
Correlation via cycle_id
Every refresh cycle gets a unique cycle_id. All events within that cycle — start, complete/error, mode fallback, spill detection — share the same cycle_id.
This lets you trace the full lifecycle of a single refresh:
# In Loki/Grafana
{job="pg_trickle"} | json | cycle_id="c-20260427-101500-017"
{"event":"refresh_start","cycle_id":"c-20260427-101500-017","refresh_mode":"DIFFERENTIAL",...}
{"event":"spill_detected","cycle_id":"c-20260427-101500-017","temp_blocks":2048,...}
{"event":"mode_fallback","cycle_id":"c-20260427-101500-017","reason":"spill_limit",...}
{"event":"refresh_complete","cycle_id":"c-20260427-101500-017","refresh_mode":"FULL","duration_ms":450,...}
One cycle_id tells the full story: the refresh started as DIFFERENTIAL, spilled to disk, fell back to FULL, and completed in 450ms.
Integration with Log Aggregators
Loki (via promtail)
# promtail config
scrape_configs:
- job_name: pg_trickle
static_configs:
- targets: [localhost]
labels:
job: pg_trickle
__path__: /var/log/postgresql/postgresql-*.log
pipeline_stages:
- match:
selector: '{job="pg_trickle"}'
stages:
- json:
expressions:
event: event
pgt_name: pgt_name
cycle_id: cycle_id
duration_ms: duration_ms
- labels:
event:
pgt_name:
Datadog
# datadog agent config
logs:
- type: file
path: /var/log/postgresql/postgresql-*.log
service: pg_trickle
source: postgresql
log_processing_rules:
- type: multi_line
name: pg_trickle_json
pattern: '^\{"event":'
Elasticsearch
// Filebeat config
{
"filebeat.inputs": [{
"type": "log",
"paths": ["/var/log/postgresql/postgresql-*.log"],
"json.keys_under_root": true,
"json.add_error_key": true,
"fields": {"service": "pg_trickle"}
}]
}
Useful Queries
Once events are in your log aggregator, common queries:
Slow refreshes (>1 second):
{job="pg_trickle"} | json | event="refresh_complete" | duration_ms > 1000
Failed refreshes:
{job="pg_trickle"} | json | event="refresh_error"
Mode fallbacks (DIFFERENTIAL → FULL):
{job="pg_trickle"} | json | event="mode_fallback" | reason != ""
Refresh frequency by stream table:
count_over_time({job="pg_trickle"} | json | event="refresh_complete" [5m]) by (pgt_name)
P99 refresh duration over time:
quantile_over_time(0.99, {job="pg_trickle"} | json | event="refresh_complete" | unwrap duration_ms [5m]) by (pgt_name)
Text Mode: Still the Default
JSON logging is opt-in. The default text format is human-readable and works fine for:
- Development and local testing
- Small deployments where you
tail -fthe logs - Environments without a log aggregator
Switch to JSON when you need:
- Structured querying across thousands of refresh events
- Alerting on specific event types
- Correlation across refresh cycles
- Integration with observability platforms (Grafana, Datadog, Splunk)
OpenTelemetry Compatibility
The JSON format is designed to be compatible with OpenTelemetry's log data model. The ts field uses ISO 8601 timestamps. The event field maps to the OTel event name. Custom fields map to OTel attributes.
If you're using the OpenTelemetry Collector, you can ingest pg_trickle's JSON logs directly:
# otel-collector config
receivers:
filelog:
include: [/var/log/postgresql/*.log]
operators:
- type: json_parser
timestamp:
parse_from: attributes.ts
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
exporters:
otlp:
endpoint: "tempo:4317"
This feeds pg_trickle events into your tracing backend alongside application traces. The cycle_id can be used as a span ID for correlation.
Summary
pg_trickle.log_format = json turns log output into structured, queryable events. Every refresh cycle gets a cycle_id for end-to-end correlation. Events cover the full lifecycle: refresh start/complete/error, mode fallbacks, CDC transitions, spill detection, backpressure, and scheduler cycles.
Pipe to Loki, Datadog, or Elasticsearch for structured querying. Use the event taxonomy to build dashboards and alerts. Use cycle_id to trace individual refresh cycles from start to finish.
The default text format is fine for development. Switch to JSON when you need real observability.
← Back to Blog Index | Documentation
Temporal Stream Tables: Time-Windowed Views That Update Themselves
Handling the "last 7 days" problem without cron
Here's a question that breaks most incremental view maintenance systems:
SELECT region, SUM(amount) AS revenue
FROM orders
WHERE created_at >= now() - interval '7 days'
GROUP BY region;
The problem isn't the query. The problem is that the result changes over time even when no data changes. At midnight, yesterday's orders cross the 7-day boundary and fall out of the window. The aggregate changes — not because a row was inserted or deleted, but because time passed.
A trigger-based CDC system doesn't see this. Nothing was written to the orders table. No trigger fired. The stream table is stale, and nobody told it.
This is the temporal IVM problem. pg_trickle solves it with temporal stream tables.
What Goes Wrong Without Temporal Awareness
Consider a stream table for "revenue in the last 24 hours":
SELECT pgtrickle.create_stream_table(
'revenue_last_24h',
$$SELECT region, SUM(amount) AS revenue
FROM orders
WHERE created_at >= now() - interval '24 hours'
GROUP BY region$$,
schedule => '5s', refresh_mode => 'DIFFERENTIAL'
);
At 3:00 PM, this correctly shows revenue from 3:00 PM yesterday to now.
At 3:05 PM, if no new orders came in, the change buffer is empty. The scheduler says "nothing to do" and skips the refresh. But the correct result changed — orders from 3:00–3:05 PM yesterday should have fallen out of the window.
By midnight, the stream table might be showing "last 24 hours" revenue that actually includes data from 36 hours ago. The longer you go without an insert, the more stale the temporal window becomes.
The brute-force fix is to run a full refresh every cycle regardless of the change buffer. But that defeats the purpose of incremental maintenance — you're scanning the entire table every 5 seconds.
How Temporal Stream Tables Work
pg_trickle's temporal mode adds a time-based eviction step to the refresh cycle:
SELECT pgtrickle.create_stream_table(
'revenue_last_24h',
$$SELECT region, SUM(amount) AS revenue
FROM orders
WHERE created_at >= now() - interval '24 hours'
GROUP BY region$$,
schedule => '5s',
refresh_mode => 'DIFFERENTIAL',
temporal_mode => 'sliding_window'
);
With temporal_mode => 'sliding_window', the refresh cycle does two things:
- Process the change buffer — normal differential maintenance for new/updated/deleted rows.
- Evict expired rows — identify rows in the stream table whose source data has fallen outside the time window, and compute the aggregate delta from removing them.
The eviction step doesn't scan the entire source table. It uses the stream table's own data and the known window boundary to identify expired contributions.
The Eviction Delta
For a SUM(amount) grouped by region:
At 3:05 PM, the window boundary is 3:05 PM yesterday. Orders from 3:00–3:05 PM yesterday need to be subtracted from the aggregate. The eviction step:
- Identifies orders in the change buffer with
created_atbetween the old boundary (3:00 PM yesterday) and the new boundary (3:05 PM yesterday). - Computes the delta: subtract those orders' amounts from their respective region groups.
- Applies the eviction delta alongside any change-buffer delta.
This is efficient: the eviction only processes the thin slice of data that crossed the window boundary since the last refresh. For a 5-second refresh cycle, that's 5 seconds' worth of source data to evict — typically a small number of rows.
Use Cases
Rolling 7-Day Revenue
SELECT pgtrickle.create_stream_table(
'revenue_7d_rolling',
$$SELECT
region,
date_trunc('hour', created_at) AS hour,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
WHERE created_at >= now() - interval '7 days'
GROUP BY region, date_trunc('hour', created_at)$$,
schedule => '10s',
refresh_mode => 'DIFFERENTIAL',
temporal_mode => 'sliding_window'
);
Every 10 seconds, this stream table:
- Adds aggregates from new orders.
- Removes aggregates from orders that just passed the 7-day mark.
Active Users (Last 15 Minutes)
SELECT pgtrickle.create_stream_table(
'active_users',
$$SELECT
COUNT(DISTINCT user_id) AS active_count,
COUNT(*) AS event_count
FROM user_events
WHERE occurred_at >= now() - interval '15 minutes'$$,
schedule => '2s',
refresh_mode => 'DIFFERENTIAL',
temporal_mode => 'sliding_window'
);
This gives you a live "active users in the last 15 minutes" count that updates every 2 seconds. At 10:15:02, it counts events from 10:00:02 to 10:15:02. At 10:15:04, it counts events from 10:00:04 to 10:15:04. The window slides continuously.
SLA Monitoring (Last Hour)
SELECT pgtrickle.create_stream_table(
'sla_compliance',
$$SELECT
service_name,
COUNT(*) AS total_requests,
COUNT(*) FILTER (WHERE response_time_ms > 500) AS slow_requests,
COUNT(*) FILTER (WHERE response_time_ms > 500)::float / COUNT(*)::float AS error_rate
FROM request_log
WHERE logged_at >= now() - interval '1 hour'
GROUP BY service_name$$,
schedule => '5s',
refresh_mode => 'DIFFERENTIAL',
temporal_mode => 'sliding_window'
);
What Temporal Mode Costs
Temporal stream tables are slightly more expensive than non-temporal ones:
| Aspect | Non-temporal | Temporal |
|---|---|---|
| Refresh with changes | Delta only | Delta + eviction |
| Refresh with no changes | Skipped | Eviction only |
| Storage | Stream table only | Stream table + boundary index |
| CPU per cycle | Proportional to delta | Proportional to delta + evicted rows |
The key difference: non-temporal stream tables skip the refresh cycle entirely when the change buffer is empty. Temporal stream tables always run the eviction step, even with an empty change buffer — because time passing is itself a change.
For most workloads, the eviction cost is small. The number of rows crossing the window boundary per cycle is bounded by: (source table insertion rate) × (refresh interval). For a table receiving 1,000 rows/second with a 5-second refresh cycle, the eviction step processes at most 5,000 rows. Typically less, because rows don't all arrive at a uniform rate.
Non-Temporal Alternatives
If you don't need a continuously-sliding window, you can use a fixed window with a simpler approach:
-- Fixed daily window: "today's orders"
SELECT pgtrickle.create_stream_table(
'revenue_today',
$$SELECT region, SUM(amount) AS revenue
FROM orders
WHERE created_at >= date_trunc('day', now())
GROUP BY region$$,
schedule => '5s',
refresh_mode => 'DIFFERENTIAL'
);
This doesn't need temporal_mode because date_trunc('day', now()) changes only once per day (at midnight). For the other 86,399 seconds, the window boundary is static and normal DIFFERENTIAL mode works. At midnight, a full refresh resets the window.
The distinction: sliding windows (now() - interval '7 days') shift every second. Fixed windows (date_trunc('day', now())) shift at discrete boundaries. Sliding windows need temporal mode. Fixed windows usually don't.
Temporal Mode vs. Scheduled Full Refresh
You could approximate temporal behavior by running a full refresh every N seconds:
SELECT pgtrickle.create_stream_table(
'revenue_last_24h',
$$SELECT region, SUM(amount) AS revenue
FROM orders
WHERE created_at >= now() - interval '24 hours'
GROUP BY region$$,
schedule => '30s',
refresh_mode => 'FULL'
);
This works, but:
- Every refresh scans the entire 24-hour window. For a table with millions of orders, this is expensive.
- The refresh takes time proportional to the window size, not the change rate.
- You can't set the schedule below a few seconds without saturating the database.
Temporal mode with DIFFERENTIAL is the efficient version: it processes only the new rows (delta) and the expired rows (eviction), not the entire window.
← Back to Blog Index | Documentation
Hot, Warm, Cold, Frozen: Tiered Scheduling at Scale
How pg_trickle's scheduler stays efficient at 50, 500, and 5,000 stream tables
At 5 stream tables, the scheduler is invisible. It wakes up every second, checks if anything is due, refreshes what needs refreshing, and goes back to sleep. CPU cost: immeasurable.
At 50 stream tables, the scheduler loop takes a few milliseconds per cycle. Still invisible.
At 500 stream tables, the scheduler is checking 500 tables every cycle. Most of them haven't changed. Most of them aren't due for refresh. But the scheduler doesn't know that until it checks. The loop time starts to matter.
pg_trickle's tiered scheduling solves this. Stream tables are classified by change frequency — hot, warm, cold, frozen — and checked at different cadences. A frozen table that hasn't changed in a week isn't checked every second.
The Tiers
| Tier | Change Frequency | Check Cadence | Example |
|---|---|---|---|
| Hot | Changes every cycle or nearly | Every scheduler cycle (1s) | Real-time dashboards, event counters |
| Warm | Changes every few cycles | Every 5 cycles | Hourly aggregates, session summaries |
| Cold | Changes infrequently | Every 30 cycles | Weekly reports, monthly rollups |
| Frozen | No changes in extended period | Every 60 cycles | Archived data, one-time imports |
"Check cadence" means how often the scheduler looks at the stream table's change buffer to determine if a refresh is needed. A cold table with a 10-second schedule is still refreshed every 10 seconds if it has changes — but the scheduler only checks for changes every 30 cycles instead of every cycle.
Automatic Classification
Tier assignment is automatic. pg_trickle tracks a rolling window of "did this stream table have changes in the last N scheduler cycles?" and classifies based on the ratio:
change_ratio = cycles_with_changes / total_cycles (over last 100 cycles)
| change_ratio | Tier |
|---|---|
| > 0.8 | Hot |
| 0.2 – 0.8 | Warm |
| 0.01 – 0.2 | Cold |
| < 0.01 | Frozen |
Promotion is immediate: if a frozen table starts receiving changes, it's promoted to hot on the next scheduler cycle. Demotion is gradual: a table must be consistently quiet to move from hot to warm, warm to cold, etc. This prevents flapping.
Why It Matters
The scheduler loop has a per-table cost. For each table, it:
- Reads the catalog entry (schedule, status, last refresh time).
- Checks the change buffer for pending changes.
- Evaluates the dependency graph (is the table due? are upstream tables current?).
- Decides whether to dispatch a refresh.
Steps 2 and 3 are the expensive parts. Step 2 requires a query against the change buffer table. Step 3 requires walking the DAG.
Without tiering, 500 tables means 500 change-buffer checks per second. With tiering, if 50 are hot, 100 are warm, 150 are cold, and 200 are frozen:
Checks per cycle:
Hot: 50 × 1.0 = 50
Warm: 100 × 0.2 = 20
Cold: 150 × 0.033 = 5
Frozen: 200 × 0.017 = 3.4
Total: ~78 per cycle (vs. 500 without tiering)
That's an 84% reduction in scheduler overhead. The savings compound because the skipped checks are precisely the tables that don't have changes — checking them would have been wasted work anyway.
Enabling Tiered Scheduling
Tiered scheduling is enabled by default since v0.12.0:
SHOW pg_trickle.tiered_scheduling;
-- on
To disable (not recommended, but available for debugging):
SET pg_trickle.tiered_scheduling = off;
With tiering off, every table is checked every cycle. This is fine for small deployments (<50 tables) but wasteful at scale.
Monitoring Tiers
SELECT
name,
schedule,
tier,
change_ratio,
last_refresh
FROM pgtrickle.pgt_status()
ORDER BY tier, change_ratio DESC;
name | schedule | tier | change_ratio | last_refresh
---------------------+----------+--------+--------------+---------------------
live_dashboard | 2s | hot | 0.95 | 2026-04-27 10:15:01
event_counter | 1s | hot | 0.88 | 2026-04-27 10:15:01
session_summary | 10s | warm | 0.45 | 2026-04-27 10:14:55
daily_revenue | 30s | warm | 0.22 | 2026-04-27 10:14:40
monthly_report | 5m | cold | 0.03 | 2026-04-27 10:10:00
archived_metrics | 1h | frozen | 0.00 | 2026-04-27 09:00:00
If a table you expect to be hot shows up as cold, check the change buffer — maybe the source table isn't receiving DML as expected.
Interaction with Event-Driven Wake
pg_trickle also supports event-driven wake via LISTEN/NOTIFY. When CDC triggers fire, they emit a NOTIFY that wakes the scheduler immediately instead of waiting for the next polling cycle.
Tiered scheduling and event-driven wake are complementary:
- Event-driven wake reduces latency: the scheduler doesn't wait up to 1 second to notice a change.
- Tiered scheduling reduces overhead: the scheduler doesn't waste cycles checking tables that haven't changed.
Both are enabled by default. Together, they handle the "many tables, sporadic changes" workload efficiently: most tables are checked infrequently (tiering), and the ones that do change are refreshed immediately (event-driven wake).
Tuning for Large Deployments
For deployments with 1,000+ stream tables:
Increase scheduler interval slightly:
SET pg_trickle.scheduler_interval_ms = 2000; -- 2 seconds instead of 1
This halves the number of scheduler cycles per second. With tiered scheduling, the per-cycle cost is already low, so the impact on freshness is minimal (hot tables are still checked every 2 seconds).
Ensure event-driven wake is on:
SET pg_trickle.event_driven_wake = on;
This ensures that hot tables are refreshed immediately on change, regardless of the scheduler interval. The scheduler interval only affects the polling fallback.
Monitor scheduler loop time:
SELECT
avg_loop_ms,
max_loop_ms,
tables_checked_per_cycle,
tables_refreshed_per_cycle
FROM pgtrickle.health_summary();
If avg_loop_ms exceeds 100ms, the scheduler is doing too much work per cycle. This typically means too many tables are classified as hot (because they all have continuous changes). Consider:
- Increasing the schedule for tables that don't need sub-second freshness.
- Using CALCULATED scheduling to let intermediate tables inherit longer cadences.
- Checking whether bulk imports are keeping change buffers permanently active.
The Frozen-to-Hot Promotion Path
When a frozen table starts receiving changes (e.g., a monthly batch import), the promotion is immediate:
- CDC trigger fires, emitting NOTIFY.
- Scheduler wakes, checks the table (even though it's in the frozen tier, NOTIFY overrides the tier check cadence).
- Table is promoted to hot.
- Normal scheduling resumes.
The promotion happens within one scheduler cycle — there's no delay from being in the frozen tier. The NOTIFY acts as an interrupt, bypassing the tier-based check cadence.
Demotion back to frozen takes longer: the table must have zero changes for ~100 consecutive cycles. This prevents flapping during sporadic-but-recurring workloads.
Summary
Tiered scheduling classifies stream tables as hot, warm, cold, or frozen based on change frequency. The scheduler checks hot tables every cycle and frozen tables every ~60 cycles, reducing per-cycle overhead by 80%+ at scale.
Classification is automatic. Promotion is immediate (via NOTIFY interrupt). Demotion is gradual (prevents flapping). It's enabled by default and requires no configuration for most deployments.
At 5 tables, you don't need it. At 500, you can't live without it.
← Back to Blog Index | Documentation
Time-Series Downsampling Without TimescaleDB
Keep hourly, daily, and monthly rollups in sync with raw data — using stream tables instead of a dedicated TSDB
Every IoT platform, observability stack, and financial system hits the same wall. Raw sensor data accumulates at thousands of rows per second. Dashboards need to display trends over hours, days, and months. Reading raw data for a 30-day chart means scanning hundreds of millions of rows. The standard answer is to pre-aggregate: maintain rollup tables at different time granularities. The non-standard part is keeping those rollups correct as late data arrives, corrections are applied, and backfills happen.
TimescaleDB solves this with continuous aggregates — materialized views that automatically refresh over time buckets. It's a good product. But it requires a dedicated extension, a specific table format (hypertables), and its own mental model. If your data is already in regular PostgreSQL tables and you want rollups that maintain themselves incrementally, pg_trickle gives you the same capability with standard SQL.
The Rollup Problem
Consider a temperature monitoring system. Sensors report every 5 seconds:
CREATE TABLE sensor_readings (
sensor_id integer,
recorded_at timestamptz,
temperature numeric(5,2),
humidity numeric(5,2)
);
You need three rollup levels:
- Hourly — average temperature and humidity per sensor per hour
- Daily — min, max, average per sensor per day
- Monthly — average and percentile distributions per sensor per month
The naive approach is a cron job that truncates and rebuilds each rollup table every hour. This works until it doesn't — until the rebuild takes longer than an hour, until a backfill invalidates three months of daily aggregates, until a dashboard user notices a 45-minute gap between reality and the chart.
The incremental approach maintains each rollup as a stream table. When a sensor reading is inserted, the hourly bucket it falls into is updated in the same transaction (or within milliseconds via the background scheduler). Late data that arrives for yesterday? The daily rollup for yesterday is corrected. Backfill three months of historical data? The monthly rollups rebuild differentially, processing only the buckets that received new data.
Hourly Rollups
SELECT pgtrickle.create_stream_table(
'sensor_hourly',
$$
SELECT
sensor_id,
date_trunc('hour', recorded_at) AS hour,
AVG(temperature) AS avg_temp,
AVG(humidity) AS avg_humidity,
COUNT(*) AS reading_count,
MIN(temperature) AS min_temp,
MAX(temperature) AS max_temp
FROM sensor_readings
GROUP BY sensor_id, date_trunc('hour', recorded_at)
$$
);
This stream table tracks every hourly bucket that has been touched. When 100 new readings arrive for sensor 42 in the 14:00 hour, the incremental refresh only recomputes the aggregate for that single bucket — not all 8,760 hourly buckets across the year. The cost is proportional to the number of distinct buckets affected, not the total data volume.
For a system ingesting 1,000 readings per second across 500 sensors, a 5-second refresh window touches approximately 500 buckets (one per sensor for the current hour). The refresh processes only those 5,000 new readings and updates 500 aggregate rows. Compare that to scanning 31 billion readings (one year of data) to rebuild from scratch.
Daily Rollups From Hourly Data
Here's where the cascade becomes powerful. The daily rollup can be defined on top of the hourly rollup:
SELECT pgtrickle.create_stream_table(
'sensor_daily',
$$
SELECT
sensor_id,
date_trunc('day', hour) AS day,
AVG(avg_temp) AS avg_temp,
MIN(min_temp) AS daily_min_temp,
MAX(max_temp) AS daily_max_temp,
SUM(reading_count) AS total_readings
FROM sensor_hourly
GROUP BY sensor_id, date_trunc('day', hour)
$$
);
Because sensor_hourly is itself a stream table, pg_trickle understands the dependency chain. When raw readings are inserted, the hourly rollup is updated first, and then the daily rollup is updated from the hourly changes. This cascade happens automatically — the DAG scheduler ensures correct ordering.
The daily rollup never touches raw data. It only processes changes to the 24 hourly buckets that make up a day. Even for a backfill of a million historical readings, the daily rollup only processes the distinct hourly buckets those readings fall into.
Handling Late Data
Late-arriving data is the nightmare scenario for traditional rollup systems. A sensor was offline for 6 hours and suddenly reports all its buffered readings with timestamps from the past. A data correction replays yesterday's data with fixed calibration values.
With pg_trickle, late data is not special. It's just an insert (or update) with a timestamp that happens to fall in a past bucket. The incremental engine processes it exactly like any other change: it identifies which hourly bucket is affected, computes the delta to the aggregate, and propagates that delta up the chain.
-- Late data arrives: sensor 42 reports readings from 6 hours ago
INSERT INTO sensor_readings (sensor_id, recorded_at, temperature, humidity)
VALUES
(42, now() - interval '6 hours', 22.5, 55.0),
(42, now() - interval '6 hours' + interval '5 seconds', 22.6, 54.8),
-- ... hundreds more
;
-- Next refresh: only the affected hourly and daily buckets are updated
SELECT pgtrickle.refresh_stream_table('sensor_hourly');
-- sensor_daily is automatically refreshed via dependency chain
No special "backfill mode." No invalidation of downstream caches. No need to identify which buckets were affected and selectively rebuild them. The differential engine handles it.
Monthly Summaries and Percentiles
For monthly reporting, you often need more than simple aggregates. Percentiles, distribution widths, and trend indicators require access to the underlying data distribution.
SELECT pgtrickle.create_stream_table(
'sensor_monthly',
$$
SELECT
sensor_id,
date_trunc('month', day) AS month,
AVG(avg_temp) AS monthly_avg_temp,
MIN(daily_min_temp) AS monthly_min,
MAX(daily_max_temp) AS monthly_max,
SUM(total_readings) AS monthly_readings,
MAX(daily_max_temp) - MIN(daily_min_temp) AS temp_range
FROM sensor_daily
GROUP BY sensor_id, date_trunc('month', day)
$$
);
The three-level cascade — raw → hourly → daily → monthly — means that inserting a single reading at the bottom can propagate all the way to the monthly summary in one refresh cycle. The total cost is three incremental updates (one per level), each touching a single group. Compare that to scanning the raw table three times with different date_trunc granularities.
Comparison With TimescaleDB Continuous Aggregates
| Feature | TimescaleDB | pg_trickle |
|---|---|---|
| Requires hypertables | Yes | No (any table) |
| Refresh granularity | Time bucket window | Per-changed-row differential |
| Cascading rollups | Manual (materialized on top of materialized) | Automatic DAG scheduling |
| Late data handling | Re-materializes entire bucket | Incremental delta on affected bucket |
| Works with JOINs | Limited (single hypertable) | Full SQL including multi-table joins |
| Extension dependency | timescaledb | pg_trickle |
The key difference is granularity. TimescaleDB refreshes an entire time bucket when any row in that bucket changes. If your hourly bucket has 10,000 rows and one arrives late, all 10,000 are re-read. pg_trickle computes the delta from the single new row and adjusts the aggregate arithmetically. For high-cardinality time series (many sensors, many metrics), this difference is substantial.
Real-World Sizing
For a production IoT deployment with:
- 10,000 sensors
- 1 reading per sensor per 10 seconds
- 1,000 readings/second sustained
The data volumes are:
- Raw table: ~2.6 billion rows/month
- Hourly rollup: 7.2M rows/month (10,000 sensors × 720 hours)
- Daily rollup: 300K rows/month (10,000 sensors × 30 days)
- Monthly rollup: 10,000 rows/month
With pg_trickle maintaining all three rollup levels, the refresh cost per second is approximately:
- Process 1,000 new raw readings
- Update ~1,000 hourly buckets (one per sensor for the current hour, but only those that received new data)
- Update ~100 daily buckets (sensors whose hourly aggregate actually changed meaningfully)
- Monthly rollup: updated once daily via scheduled refresh
Total CPU cost: a few milliseconds per second of ingested data. The dashboards are never more than a few seconds stale, and the database never performs a full scan of billions of rows.
Getting Started
-- Just create your regular table (no hypertable conversion needed)
CREATE TABLE metrics (
device_id integer,
ts timestamptz DEFAULT now(),
value double precision
);
-- Define your rollups as stream tables
SELECT pgtrickle.create_stream_table(
'metrics_hourly',
$$
SELECT device_id,
date_trunc('hour', ts) AS hour,
AVG(value) AS avg_val,
COUNT(*) AS samples
FROM metrics
GROUP BY device_id, date_trunc('hour', ts)
$$
);
-- Insert data normally
INSERT INTO metrics (device_id, value)
SELECT (random() * 100)::int, random() * 50
FROM generate_series(1, 10000);
-- Refresh — only new data is processed
SELECT pgtrickle.refresh_stream_table('metrics_hourly');
Your time-series rollups are live. No hypertable conversion, no extension-specific table types, no refresh policies to configure. Just SQL.
Stop rebuilding rollups from scratch. Let the differential engine propagate only what changed.
← Back to Blog Index | Documentation
TPC-H at 1GB in 40ms
Benchmarking Incremental View Maintenance Against Full Refresh
Benchmarks are often used to mislead. Single-number results without methodology, workloads that don't match production, or optimistic configurations chosen to flatter the system under test.
This post does something different. It runs the TPC-H benchmark — a standard decision-support workload — in two modes: full refresh and differential refresh. It shows the actual numbers, explains the methodology, and tells you when the differential results don't apply.
The point is not to show pg_trickle winning everywhere. It's to show where the differential approach has large wins, where the wins are modest, and where full refresh is the right answer.
The Benchmark Setup
Hardware: 8-core AMD EPYC 9254 (2.9GHz base), 32GB RAM, NVMe SSD (Seagate FireCuda, ~7GB/s sequential read), PostgreSQL 18.1.
Dataset: TPC-H scale factor 1 (approximately 1GB of raw data across 8 tables: lineitem, orders, customer, part, partsupp, supplier, nation, region).
PostgreSQL config:
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 256MB
maintenance_work_mem = 2GB
max_parallel_workers_per_gather = 4
pg_trickle config:
pg_trickle.max_parallel_workers = 4
pg_trickle.backpressure_enabled = off # disabled for benchmark clarity
Methodology: Each stream table is created. An initial full refresh establishes the baseline. Then a "delta batch" of 1,000 modified rows is applied to the relevant source tables (simulating one refresh cycle's worth of changes). We measure the time from change application to a consistent stream table state.
For full refresh mode, this means running REFRESH MATERIALIZED VIEW CONCURRENTLY and measuring wall time. For differential mode, it means measuring one pg_trickle refresh cycle.
The Queries
We implement five TPC-H queries as stream tables. These represent a range of complexity from simple aggregates to multi-table joins:
Q1: Pricing Summary
-- Aggregate lineitems by return flag and line status
SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount))
AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax))
AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= DATE '1998-09-02'
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;
Q3: Shipping Priority
-- Revenue for top unshipped orders by market segment and order date
SELECT
l.l_orderkey,
SUM(l.l_extendedprice * (1 - l.l_discount)) AS revenue,
o.o_orderdate,
o.o_shippriority
FROM customer c
JOIN orders o ON c.c_custkey = o.o_custkey
JOIN lineitem l ON l.l_orderkey = o.o_orderkey
WHERE c.c_mktsegment = 'BUILDING'
AND o.o_orderdate < DATE '1995-03-15'
AND l.l_shipdate > DATE '1995-03-15'
GROUP BY l.l_orderkey, o.o_orderdate, o.o_shippriority;
Q6: Forecasting Revenue Change
-- Revenue change forecast based on discounts and quantities
SELECT
SUM(l_extendedprice * l_discount) AS revenue
FROM lineitem
WHERE l_shipdate >= DATE '1994-01-01'
AND l_shipdate < DATE '1995-01-01'
AND l_discount BETWEEN 0.06 - 0.01 AND 0.06 + 0.01
AND l_quantity < 24;
Q5: Local Supplier Volume
-- Revenue through local suppliers per nation in Asia
SELECT
n.n_name,
SUM(l.l_extendedprice * (1 - l.l_discount)) AS revenue
FROM customer c
JOIN orders o ON c.c_custkey = o.o_custkey
JOIN lineitem l ON l.l_orderkey = o.o_orderkey
JOIN supplier s ON l.l_suppkey = s.s_suppkey
JOIN nation n ON s.s_nationkey = n.n_nationkey
JOIN region r ON n.n_regionkey = r.r_regionkey
WHERE r.r_name = 'ASIA'
AND o.o_orderdate >= DATE '1994-01-01'
AND o.o_orderdate < DATE '1995-01-01'
GROUP BY n.n_name;
Q12: Shipping Modes and Order Priority
-- Distribution of high-priority orders by shipping mode
SELECT
l.l_shipmode,
SUM(CASE WHEN o.o_orderpriority = '1-URGENT'
OR o.o_orderpriority = '2-HIGH'
THEN 1 ELSE 0 END) AS high_line_count,
SUM(CASE WHEN o.o_orderpriority <> '1-URGENT'
AND o.o_orderpriority <> '2-HIGH'
THEN 1 ELSE 0 END) AS low_line_count
FROM orders o
JOIN lineitem l ON o.o_orderkey = l.l_orderkey
WHERE l.l_shipmode IN ('MAIL', 'SHIP')
AND l.l_commitdate < l.l_receiptdate
AND l.l_shipdate < l.l_commitdate
AND l.l_receiptdate >= DATE '1994-01-01'
AND l.l_receiptdate < DATE '1995-01-01'
GROUP BY l.l_shipmode;
The Results
Single Refresh Cycle: 1,000 Modified Rows
| Query | Full Refresh (ms) | Differential (ms) | Speedup |
|---|---|---|---|
| Q1 (simple aggregate, 1 table) | 890 | 41 | 21.7× |
| Q6 (filter + aggregate, 1 table) | 620 | 38 | 16.3× |
| Q12 (2-table join + conditional agg) | 1,240 | 68 | 18.2× |
| Q3 (3-table join + aggregate) | 2,180 | 112 | 19.5× |
| Q5 (6-table join + aggregate) | 3,890 | 287 | 13.6× |
Throughput: 5,000 Changes per Second Sustained
Applying changes at 5,000 rows/second to lineitem and measuring how much stream tables lag behind:
| Query | Full Refresh Lag | Differential Lag |
|---|---|---|
| Q1 | 31 seconds | 0.3 seconds |
| Q3 | 87 seconds | 0.8 seconds |
| Q5 | 186 seconds | 2.4 seconds |
Full refresh can't keep up. The refresh takes longer than the interval between refreshes. Under sustained write load, the lag grows without bound until writes stop.
Differential keeps up at all five query throughputs tested.
Why the Numbers Are What They Are
Q1 and Q6 (single-table aggregates): These have the best differential speedup because the delta is fully local — a changed lineitem row affects exactly one group in the aggregate. The DVM engine does one index lookup per changed row, applies the delta to one group, done. The full scan, by contrast, reads all 6 million lineitem rows.
Q3 (3-table join): The delta propagation requires looking up customer via orders for each changed lineitem. This is 1,000 changed rows × 1 join lookup each = 1,000 index lookups. Still far cheaper than scanning all three tables.
Q5 (6-table join): The longest delta chain — a change in lineitem propagates through supplier → nation → region to determine whether the supplier is in Asia. Six tables means five join hops in the delta computation. The 287ms differential time is still 13× faster than full refresh, but the advantage narrows as join depth increases.
The pattern is clear: as query complexity (join count, group count) increases, the differential speedup decreases. But it's always faster, because the alternative is scanning millions of rows.
The Initial Population Cost
One number that's easy to overlook: creating a stream table with refresh_mode => 'DIFFERENTIAL' requires an initial full population. At TPC-H SF1, these times are:
| Query | Initial population |
|---|---|
| Q1 | 1.2 seconds |
| Q3 | 3.8 seconds |
| Q5 | 6.1 seconds |
You pay this cost once — at stream table creation time, or after a schema change that requires a rebuild. After that, every refresh cycle is differential.
Cases Where Full Refresh Wins
There are queries where differential mode doesn't apply or helps less than expected:
MEDIAN and other non-differentiable aggregates: These always require full refresh. There's no algebraic delta rule. The full refresh time is the cost you pay.
Very small tables: For a summary table with 10 rows derived from 1,000 source rows, a full scan takes < 1ms. The overhead of change buffer management and delta computation adds more latency than it saves. Use full refresh for tiny lookups.
Bulk loads: When you load 1 million rows at once, the "delta" is 1 million rows. The differential path processes each changed row with the same per-row overhead as a scan. In fact, for very large deltas, the scan path is faster because it can be parallelized. pg_trickle automatically falls back to a full refresh when a bulk load is detected (configurable via bulk_load_threshold).
First-time refresh after a large schema change: If you add a new column that requires recomputing every row, that's a full scan regardless of refresh mode. Plan for this in your change management process.
Reproducing These Results
The TPC-H benchmark is included in the pg_trickle test suite. You can run it yourself:
# Build the E2E test image
just build-e2e-image
# Run TPC-H tests (marked as ignored by default due to long runtime)
cargo test --test e2e_tpch_tests -- --ignored --test-threads=1 --nocapture
# Control the number of cycles
TPCH_CYCLES=10 cargo test --test e2e_tpch_tests -- --ignored --test-threads=1 --nocapture
The test generates TPC-H scale factor 1 data, creates stream tables for all five queries, applies delta batches of 1,000 rows, and measures refresh time and lag. Results are logged to stdout.
We run this benchmark on every push to main in CI (GitHub Actions, ubuntu-latest, 4-core runner). The CI hardware is slower than the local NVMe machine above — expect 2–3× longer absolute times, but similar ratios.
What This Means for Your Workload
TPC-H is a decision-support workload — complex analytical queries over large tables. Your OLTP workload will likely see better differential speedups for a few reasons:
- OLTP tables tend to have smaller row counts per table but more tables
- OLTP changes are typically small, scattered updates (not bulk)
- OLTP aggregates are usually simpler (COUNT, SUM) over more focused ranges
For a typical e-commerce dashboard (daily revenue by region, customer order counts, inventory levels), expect:
- Full refresh: 200ms–5s depending on table size
- Differential: 5–50ms per cycle, regardless of table size
For a real-time leaderboard or live feed:
- Full refresh: 500ms–2s
- Differential: < 10ms per cycle
The speedup is most dramatic when the ratio of "changed rows per cycle" to "total rows" is small. That ratio is almost always very small in production. Hence the consistent wins in the benchmark.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. The full benchmark suite is in the repository at tests/e2e_tpch_tests.rs.
← Back to Blog Index | Documentation
The Hidden Cost of Trigger-Based Denormalization
How hand-rolled sync logic breaks — and what to do about it
You have a products table, a categories table, a suppliers table, and an inventory table. Your application needs to display a product listing page that combines fields from all four. You need it fast — 2ms, not 200ms. So you create a denormalized product_listing table with all the fields pre-joined, and you write triggers to keep it in sync.
That was eighteen months ago.
Today your denormalized table has seven triggers across four source tables, each written by a different engineer in a different sprint. Two of them have subtle race conditions. One of them has a performance regression that nobody traced back to the trigger. The data drifts by 0.3% every week in ways you can't explain. Your oncall rotation includes a step called "rerun the denorm sync script."
This is not bad engineering. This is the predictable outcome of building derived data maintenance with the wrong abstraction.
Why Triggers Seem Like the Right Answer
Triggers are PostgreSQL's built-in mechanism for reacting to changes. They fire on INSERT, UPDATE, DELETE. They run in the same transaction as the change. They're fast. They're simple for the first case.
The first trigger you write is usually clean:
CREATE OR REPLACE FUNCTION sync_product_listing()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
INSERT INTO product_listing (
product_id, product_name, category_name, supplier_name, stock_qty
)
SELECT
p.id, p.name, c.name, s.name, i.qty
FROM products p
JOIN categories c ON c.id = p.category_id
JOIN suppliers s ON s.id = p.supplier_id
LEFT JOIN inventory i ON i.product_id = p.id
WHERE p.id = NEW.id
ON CONFLICT (product_id) DO UPDATE SET
product_name = EXCLUDED.product_name,
category_name = EXCLUDED.category_name,
supplier_name = EXCLUDED.supplier_name,
stock_qty = EXCLUDED.stock_qty;
RETURN NEW;
END;
$$;
CREATE TRIGGER trg_products_to_listing
AFTER INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION sync_product_listing();
Works perfectly. Fast, transactional, no external jobs. Ship it.
Then requirements change.
The Cascade of Complexity
Month 1: You add a trigger for categories changes. When a category is renamed, every product in that category needs its category_name updated. You write a statement-level trigger that issues a bulk UPDATE.
Month 3: A supplier changes their name. Same pattern — bulk UPDATE for all affected products. A third engineer adds this trigger. The bulk UPDATE takes 3 seconds on a busy day because there are 40,000 products from that supplier. The original product INSERT trigger is now occasionally blocked waiting for a row lock.
Month 5: Inventory levels change thousands of times a day. Someone adds a trigger. Performance immediately degrades — product_listing is being updated on every inventory change, and inventory changes very often. The trigger is rewritten to only update if stock_qty changes by more than 10 units. This works until a product goes from 5 units to 0 units in increments of 1, and the product_listing never reaches stock_qty = 0.
Month 7: You discover that batch imports of products (10,000 rows via COPY) are slow because the row-level trigger fires 10,000 times. Someone adds a WHEN (tg_op = 'INSERT') condition and a separate batch-import procedure that bypasses the trigger. Now there are two code paths, and they've drifted.
Month 11: The categories table gains a parent_category_id column. The category path for display is now Electronics > Phones > Accessories. Updating the trigger to compute the category path inline is painful. A category_path helper function is added. It does a recursive CTE on every trigger invocation.
Month 15: A data audit finds 847 rows in product_listing where category_name doesn't match categories.name. Investigation reveals a bug in the categories update trigger that was introduced in Month 1 and fixed in Month 8, but the 847 rows changed during the window when the bug existed. A one-time repair script is written and added to runbooks.
The Root Cause
Triggers are imperative. You describe how to update the denormalized table, not what it should contain.
This distinction matters because "how" changes as the query changes. Every time the definition of product_listing evolves — new join, new column, changed logic — every trigger that touches product_listing must be reviewed and potentially updated. Correctness requires that all triggers stay synchronized with each other and with the current definition.
In practice, they don't. Engineers add columns to the denormalized table and forget to add them to the triggers. The query logic in the trigger diverges from the query logic in the application. The triggers were written for row-level operations but the correct logic requires seeing multiple rows at once.
There are four specific failure modes that appear in almost every trigger-based denormalization system at sufficient age:
Failure Mode 1: The Blind UPDATE
The typical pattern is:
-- In the categories trigger
UPDATE product_listing
SET category_name = NEW.name
WHERE category_id = NEW.id;
This works when product_listing.category_name should always equal categories.name. It breaks the instant category_name in product_listing is derived from anything other than categories.name directly — say, a concatenation with the parent category. Now you need the full query logic in the trigger, but the trigger was written before the concatenation existed.
Failure Mode 2: Statement vs. Row Triggers
Row-level triggers fire once per affected row. Statement-level triggers fire once per DML statement. They have different semantics, especially for bulk operations, and most developers don't think about which they need until they hit a correctness bug.
If you update 1,000 product records in a single UPDATE statement:
- A
FOR EACH ROWtrigger fires 1,000 times, each with access to the individual row change. Safe but slow for bulk operations. - A
FOR EACH STATEMENTtrigger fires once, with access to the set of changes via transition tables (OLD TABLE,NEW TABLE). Fast but requires set-based logic that most people get wrong.
Teams typically start with FOR EACH ROW (easier to write), discover performance problems, try to rewrite as FOR EACH STATEMENT, introduce bugs in the set logic, and revert. Or they give up and schedule a batch job.
Failure Mode 3: The Invisible Delete
INSERTs and UPDATEs are straightforward to handle. Deletes are harder to get right.
When a supplier is deleted, you need to either delete the product_listing rows for their products or set the supplier_name to NULL. But the trigger fires after the row is deleted, so you can't join back to suppliers to find which products were affected. You need to store the old supplier ID somewhere before it disappears.
-- DELETE trigger: OLD has the deleted row, but products still have supplier_id
-- This is fine. But what if the product itself is cascade-deleted too?
-- Now your trigger runs on BOTH tables and order matters.
Cascade deletions, deferred constraints, and foreign key actions interact with trigger ordering in ways that are documented but rarely read. The failure mode is usually a FOREIGN KEY violation or an orphaned denormalized row.
Failure Mode 4: The Multi-Row Race
Consider two concurrent transactions:
- Transaction A updates product 1's category from X to Y
- Transaction B updates category Y's name
Both trigger the denorm update for product 1. Which one wins? In READ COMMITTED isolation (the PostgreSQL default), the answer is "whichever commits last," but the trigger logic in each transaction sees a different snapshot of the data. You can end up with product_listing.category_name reflecting a state that never actually existed — Y's new name applied to a product that still had category X at the time of the name change.
This is a race condition. It's rare, intermittent, and produces data that's almost right — the hardest kind of bug to find.
What IVM Does Differently
The key difference between trigger-based denormalization and incremental view maintenance is that IVM is declarative. You declare what the result should be. The engine figures out how to maintain it.
SELECT pgtrickle.create_stream_table(
name => 'product_listing',
query => $$
SELECT
p.id AS product_id,
p.name AS product_name,
CONCAT(pc.name, ' > ', c.name) AS category_path,
s.name AS supplier_name,
s.lead_days,
i.qty AS stock_qty,
i.updated_at AS inventory_updated_at
FROM products p
JOIN categories c ON c.id = p.category_id
JOIN categories pc ON pc.id = c.parent_id
JOIN suppliers s ON s.id = p.supplier_id
LEFT JOIN inventory i ON i.product_id = p.id
WHERE p.published = true
$$,
schedule => '5 seconds',
refresh_mode => 'DIFFERENTIAL'
);
When categories is renamed, when a supplier is deleted, when inventory changes — pg_trickle's DVM engine computes the correct delta and applies it. The query definition is the ground truth. There are no separate trigger functions to keep in sync with it.
The query can be as complex as you need — multiple joins, computed columns, filters, aggregates. Change it by calling pgtrickle.alter_stream_table() with a new query. The engine rebuilds the stream table and starts maintaining the new definition.
The Concurrency Story
IVM handles the multi-row race correctly by design. The CDC triggers capture changes as part of the source transaction, but the delta is applied by the background worker in a subsequent transaction that sees a consistent snapshot. The worker never applies a partial or inconsistent state.
The correctness guarantee is: product_listing is always a consistent snapshot of the query result as of some point in time. It may be up to schedule seconds behind, but it's never internally inconsistent.
Trigger-based denormalization doesn't offer this guarantee. Each trigger update is a separate, independently committed transaction. Between the products trigger and the categories trigger for a related change, there's a window where product_listing reflects one change but not the other.
Performance: Triggers vs. IVM
For low-volume write workloads, row-level triggers are fast — they add microseconds to each DML. The overhead is per-row and constant.
For high-volume write workloads, triggers have two problems:
-
Lock contention: Every trigger that writes to
product_listingacquires a row lock on the destination. High-concurrency writes create a serialization point at the denormalized table. -
Write amplification: One change to
categoriesmight update thousands of rows inproduct_listing. This is hidden write amplification — the DML statement returns fast, but you've actually done an O(affected rows) write behind the scenes.
pg_trickle batches these changes. A stream table refresh cycle applies one consistent delta per affected group — one UPDATE per changed aggregate, one INSERT/DELETE per changed join result. The background worker runs at a configurable cadence, so high-frequency source writes are amortized.
For inventory-level changes (the "update thousands of times a day" problem from month 5 above), pg_trickle coalesces multiple changes to the same product within a refresh cycle. If a product's inventory changes 20 times in 5 seconds, only the net change is applied to product_listing in that cycle.
Migration Path
If you have an existing trigger-based denormalization setup, the migration is:
- Create the stream table with
refresh_mode => 'FULL'and verify it produces the correct output. - Switch to
refresh_mode => 'DIFFERENTIAL'once the query is verified. - Drop the manual triggers.
- Drop the maintenance scripts from your runbook.
The existing triggers and the stream table can coexist during migration — product_listing is just a table, and pg_trickle's refreshes are transactional. As long as you reconcile the two update paths before going live, the migration is safe.
What You Keep
One thing trigger-based denormalization does that pg_trickle doesn't: immediate consistency. A trigger fires in the same transaction as the change. The denormalized table is updated before the original transaction commits.
If your application relies on reading the denormalized table immediately after writing to a source table in the same transaction, you need immediate consistency. This is an uncommon pattern but it exists.
For everything else — dashboards, search corpora, reporting tables, user-facing aggregates — the 5-second staleness window of a scheduled IVM refresh is acceptable and the operational simplicity is worth it.
The trigger function graveyard you've been maintaining? Replace it with a SQL query. One source of truth for what the table contains.
pg_trickle is an open-source PostgreSQL extension for incremental view maintenance. Source and documentation at github.com/trickle-labs/pg-trickle.
← Back to Blog Index | Documentation
Window Functions Without the Full Recompute
How pg_trickle maintains ROW_NUMBER, RANK, LAG, and LEAD incrementally
Window functions are the most useful SQL feature that everyone avoids putting in materialized views. The reason is obvious: ROW_NUMBER() depends on the ordering of the entire result set. Change one row and every subsequent row number might shift. Full recomputation seems unavoidable.
pg_trickle avoids it anyway, using a technique called partition-scoped recomputation. The idea: most window functions include a PARTITION BY clause, and a change to one partition doesn't affect other partitions. If you have 10,000 partitions and one row changes, you recompute one partition, not 10,000.
This post explains how it works, what it costs, and when it doesn't help.
The Problem
Consider a sales leaderboard:
SELECT
region,
salesperson,
total_sales,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_sales DESC) AS rank,
LAG(total_sales) OVER (PARTITION BY region ORDER BY total_sales DESC) AS prev_sales
FROM sales_summary;
Without IVM, you run this query on demand or cache it in a materialized view. Either way, every evaluation scans the entire sales_summary table, sorts each partition, and computes row numbers and lag values.
If one salesperson in the "Northeast" region closes a deal, ideally you'd recompute only the Northeast partition. The other 49 regions haven't changed.
That's exactly what pg_trickle does.
Partition-Scoped Recomputation
When pg_trickle detects a window function in a stream table query, it doesn't try to compute a row-level delta for the window output. Window functions don't have the same algebraic delta properties as SUM or COUNT — there's no closed-form expression for "how does ROW_NUMBER() change when row X is inserted?"
Instead, it uses a coarser but still efficient strategy:
-
Identify affected partitions. From the change buffer, extract the distinct partition key values that were touched. If a row in the "Northeast" region was inserted, "Northeast" is an affected partition.
-
Delete old window results for those partitions. Remove all rows from the stream table where
region = 'Northeast'. -
Recompute the window function for those partitions. Run the window query against the current source data, filtered to only the affected partitions.
-
Insert the recomputed rows.
The cost is proportional to the size of the affected partitions, not the size of the entire table.
What This Looks Like in Practice
SELECT pgtrickle.create_stream_table(
name => 'sales_leaderboard',
query => $$
SELECT
region,
salesperson,
total_sales,
ROW_NUMBER() OVER (
PARTITION BY region ORDER BY total_sales DESC
) AS rank,
LEAD(total_sales) OVER (
PARTITION BY region ORDER BY total_sales DESC
) AS next_below
FROM sales_summary
$$,
schedule => '5s'
);
If sales_summary is itself a stream table (maintained incrementally from raw orders), you get a two-level pipeline: orders → summary (algebraic delta) → leaderboard (partition-scoped recompute). The total latency for one new order to appear in the leaderboard is typically under 100ms.
Supported Window Functions
pg_trickle supports partition-scoped recomputation for all standard window functions:
| Function | Notes |
|---|---|
ROW_NUMBER() | Most common. Partition recompute is exact. |
RANK() | Ties handled correctly — all tied rows get the same rank. |
DENSE_RANK() | No gaps in ranking sequence. |
LAG(expr, offset) | Looks at the previous row in the partition. |
LEAD(expr, offset) | Looks at the next row. |
FIRST_VALUE(expr) | First row in the window frame. |
LAST_VALUE(expr) | Last row in the window frame. Requires careful frame specification. |
NTH_VALUE(expr, n) | Nth row in the frame. |
NTILE(n) | Divides partition into n roughly equal groups. |
CUME_DIST() | Cumulative distribution. |
PERCENT_RANK() | Relative rank as a fraction. |
All of these work with ROWS, RANGE, and GROUPS frame specifications.
The No-Partition Case
What if there's no PARTITION BY?
SELECT
id,
value,
ROW_NUMBER() OVER (ORDER BY value DESC) AS global_rank
FROM measurements;
Without a partition clause, the entire result set is one partition. A change to any row triggers a full recomputation of the window function. In this case, DIFFERENTIAL mode degrades to something close to FULL refresh for the window step.
pg_trickle still applies differential maintenance to the steps before the window function (filters, joins, aggregates). Only the window recomputation itself is full. If the query is SELECT ... FROM big_table WHERE active = true and 1% of rows are active, the window recompute runs over 1% of the data, not 100%.
Recommendation: If you need global ranking over a large table, add an artificial partition key or accept that the window step will be a full recompute. For tables under ~100K rows, the full recompute is fast enough that it doesn't matter.
Multiple Window Clauses
Queries can have multiple OVER clauses with different partitioning:
SELECT
department,
team,
employee,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
RANK() OVER (PARTITION BY team ORDER BY salary DESC) AS team_rank
FROM employees;
pg_trickle handles this through its automatic rewrite pipeline. The query is decomposed into two window passes:
- First pass: compute
dept_rankpartitioned bydepartment. - Second pass: compute
team_rankpartitioned byteam.
Each pass uses its own partition key for scoped recomputation. If an employee in "Engineering/Backend" gets a raise, pg_trickle recomputes:
- The "Engineering" partition for
dept_rank - The "Backend" partition for
team_rank
Other departments and teams are untouched.
Window Functions After GROUP BY
A common pattern: aggregate first, then rank.
SELECT pgtrickle.create_stream_table(
name => 'top_customers_by_region',
query => $$
SELECT
region,
customer_id,
total_revenue,
RANK() OVER (PARTITION BY region ORDER BY total_revenue DESC) AS rank
FROM (
SELECT
c.region,
o.customer_id,
SUM(o.total) AS total_revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY c.region, o.customer_id
) sub
$$,
schedule => '10s'
);
pg_trickle processes this as a two-stage pipeline:
- Inner query (join + GROUP BY): maintained incrementally using algebraic delta rules. Only the groups affected by changed orders are updated.
- Outer query (window function): partition-scoped recompute on the groups that changed.
The amplification factor is low. If 5 orders come in across 3 customers in 2 regions, the inner stage updates 3 groups. The outer stage recomputes 2 partitions.
Frame Specifications
Window frames control which rows the function can see:
-- Running total (default frame)
SUM(amount) OVER (PARTITION BY account ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
-- 7-row moving average
AVG(value) OVER (PARTITION BY sensor ORDER BY ts
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
-- Range-based: all rows with same date
COUNT(*) OVER (PARTITION BY category ORDER BY date
RANGE BETWEEN '0 days' PRECEDING AND '0 days' FOLLOWING)
All frame types work with partition-scoped recomputation. The frame specification doesn't change the partitioning strategy — it only affects what the window function computes within the partition.
Performance: When It Helps, When It Doesn't
The speedup from partition-scoped recomputation depends on two factors:
- Number of partitions. More partitions = smaller per-partition recompute = bigger speedup.
- Number of affected partitions per refresh. If every refresh touches every partition, you're doing a full recompute with extra overhead.
| Scenario | Partitions | Affected/cycle | Speedup vs. FULL |
|---|---|---|---|
| Regional leaderboard | 50 | 2–3 | ~20× |
| Per-customer ranking | 10,000 | 10–50 | ~200× |
| Per-sensor percentile | 1,000 | 5–20 | ~50× |
| Global ranking (no PARTITION BY) | 1 | 1 | ~1× (no benefit) |
| High-churn table (all partitions touched) | 100 | 100 | ~1× (no benefit) |
The break-even point: if more than ~30% of partitions are affected in a single refresh cycle, pg_trickle's AUTO mode will likely choose FULL refresh instead. The partition-scoped approach has overhead (identifying affected partitions, deleting and re-inserting) that isn't worth it when most partitions are changing anyway.
Summary
Window functions in stream tables work via partition-scoped recomputation: identify which partitions were affected by source changes, recompute only those partitions, leave the rest untouched.
It's not a row-level delta — it's a partition-level delta. But for the common case of many partitions with localized changes, the performance difference is dramatic. A leaderboard over 50 regions with 100,000 salespeople refreshes the same as a leaderboard over one region with 2,000.
The rule of thumb: if your window function has PARTITION BY and changes are spread across a small fraction of partitions, use DIFFERENTIAL mode. If changes hit most partitions or there's no PARTITION BY, AUTO mode will choose FULL refresh, which is the right call.
← Back to Blog Index | Documentation
The Z-Set: The Data Structure That Makes IVM Correct
A short tour of the data structure under pg_trickle's differential engine
Every incremental view maintenance system needs a way to represent changes. "Row 42 was inserted." "Row 17 was deleted." "Row 99 was updated from value A to value B."
Most systems represent this as a list of operations: INSERT, DELETE, UPDATE. The operations are applied sequentially. Ordering matters. If you apply a DELETE before the corresponding INSERT, you get an error.
pg_trickle uses a different representation: the Z-set (integer-weighted multiset). It's simpler, more compositional, and eliminates an entire class of bugs.
What a Z-Set Is
A Z-set is a collection of (element, weight) pairs, where the weight is an integer.
| Element | Weight |
|---|---|
| (alice, europe, 100) | +1 |
| (bob, asia, 200) | +1 |
| (charlie, europe, 150) | +1 |
This Z-set represents a table with three rows. Each row has weight +1, meaning "present."
When you insert a row, you add it with weight +1:
| Element | Weight |
|---|---|
| (dave, asia, 300) | +1 |
When you delete a row, you add it with weight -1:
| Element | Weight |
|---|---|
| (bob, asia, 200) | -1 |
When you update a row (bob's amount changes from 200 to 250):
| Element | Weight |
|---|---|
| (bob, asia, 200) | -1 |
| (bob, asia, 250) | +1 |
An update is a delete-then-insert. There's no special "update" operation. This simplification is the key insight.
Why Weights Instead of Operations
The operation-based approach (INSERT/DELETE/UPDATE as separate types) has a problem: the order of operations matters. If you have:
INSERT (alice, 100)
DELETE (alice, 100)
INSERT (alice, 200)
Reordering these gives different results. The system has to track and preserve ordering.
With Z-sets, the order doesn't matter. You just sum the weights:
(alice, 100): +1 - 1 = 0 ← not present
(alice, 200): +1 ← present
Net result: alice has value 200. The intermediate states cancel out. You can process the changes in any order and get the same result.
This makes Z-sets commutative and associative — you can combine them freely without worrying about ordering. This property is what allows pg_trickle to batch changes and process them efficiently.
Delta Rules as Weight Arithmetic
The differential engine's job is to transform a Z-set of input changes into a Z-set of output changes. Each SQL operator has a delta rule expressed as weight arithmetic.
Filter (WHERE)
-- Query: SELECT * FROM t WHERE amount > 100
-- Delta rule: keep rows where amount > 100, preserve weights
Input delta:
| Element | Weight |
|---|---|
| (alice, 150) | +1 |
| (bob, 50) | +1 |
Output delta:
| Element | Weight |
|---|---|
| (alice, 150) | +1 |
Bob's row has weight +1 in the input but doesn't pass the filter, so it's not in the output. Alice's row passes, so its weight is preserved.
Projection (SELECT columns)
If projecting causes duplicates, weights add up:
-- Query: SELECT region FROM customers
-- Input has two customers in 'europe'
| Element | Weight |
|---|---|
| (alice, europe) → europe | +1 |
| (bob, europe) → europe | +1 |
Result: (europe, +2). The region "europe" appears twice.
JOIN
The JOIN delta rule is where Z-sets really shine. Given tables R and S:
Δ(R ⋈ S) = (ΔR ⋈ S) ∪ (R ⋈ ΔS) ∪ (ΔR ⋈ ΔS)
In words: the change to a JOIN result is the union of:
- New R rows joined with existing S rows
- Existing R rows joined with new S rows
- New R rows joined with new S rows (handles the case where both sides change simultaneously)
Because we're working with Z-sets, the union is just weight addition. If a pair appears in both term 1 and term 3, their weights add. This handles double-counting automatically.
Aggregation (GROUP BY + SUM)
-- Query: SELECT region, SUM(amount) FROM orders GROUP BY region
Input delta (an order for $100 in europe is inserted):
| Element | Weight |
|---|---|
| (europe, 100) | +1 |
The delta rule for SUM:
- Find the group (europe).
- Add weight × value to the running sum: +1 × 100 = +100.
- Output: the europe group's sum increases by 100.
If a row is deleted (weight -1), the same rule applies: -1 × 100 = -100. The sum decreases.
For COUNT: it's just the sum of weights.
For AVG: it's maintained as (SUM, COUNT) internally. AVG = SUM / COUNT.
Composition
The power of Z-sets is that delta rules compose. A query like:
SELECT region, SUM(amount) FROM orders
JOIN customers ON customers.id = orders.customer_id
WHERE orders.status = 'shipped'
GROUP BY region;
is decomposed into: Filter → Join → GroupBy+Sum. Each operator's delta rule takes a Z-set as input and produces a Z-set as output. The Z-set from Filter feeds into the Z-set for Join, which feeds into GroupBy+Sum.
Because each rule preserves the Z-set structure (input: weighted multiset → output: weighted multiset), you can chain them without special glue code. The intermediate Z-sets handle cancellations, duplicates, and concurrent changes automatically.
Consolidation
After processing all the delta rules, the output Z-set might have redundant entries:
| Element | Weight |
|---|---|
| (europe, 1000) | -1 |
| (europe, 1100) | +1 |
| (europe, 1100) | +1 |
Consolidation sums weights for identical elements:
| Element | Weight |
|---|---|
| (europe, 1000) | -1 |
| (europe, 1100) | +2 |
And removes elements with weight 0:
If (europe, 1000) had weight +1 and -1, the net is 0 — it's removed from the output entirely.
The final consolidated Z-set is the MERGE operation: elements with negative weight are DELETEd from the stream table, elements with positive weight are INSERTed.
What Z-Sets Can't Handle
Z-sets work for operators with well-defined inverses:
- SUM: adding is the inverse of subtracting
- COUNT: incrementing is the inverse of decrementing
- AVG: maintained as (SUM, COUNT), both invertible
- MIN/MAX: invertible with caveats (need to re-scan the group when the min/max value is deleted)
Z-sets don't work for:
- MEDIAN: No closed-form inverse. Deleting a row can change the median to any other value in the group.
- PERCENTILE_CONT/DISC: Same problem — order statistics don't have algebraic inverses.
- DISTINCT without GROUP BY: The weight represents multiplicity, but DISTINCT collapses it to 1. Deleting one of three duplicates changes the weight from 3 to 2, but DISTINCT still outputs 1. The delta is zero. Deleting the last duplicate changes weight from 1 to 0 — now the delta is -1.
For these cases, pg_trickle falls back to FULL refresh mode. The Z-set representation is still used internally, but the delta rule re-scans the group rather than computing a closed-form delta.
Why This Matters in Practice
You don't need to think about Z-sets when using pg_trickle. The abstraction is internal.
But understanding Z-sets explains:
- Why some aggregates support DIFFERENTIAL and others don't. It's about whether the aggregate has an algebraic inverse in Z-set arithmetic.
- Why UPDATEs are handled correctly without special casing. They're just delete+insert = weight -1 and weight +1.
- Why concurrent changes don't cause double-counting. Weights are additive and commutative. The order of processing doesn't matter.
- Why pg_trickle's correctness testing works. The multiset invariant (DIFFERENTIAL result = FULL result) is a Z-set equality check.
The Z-set is the foundation that makes everything else in the differential engine compositional and correct.