Monitoring Cookbook

Practical recipes for monitoring pg_tide in production. Each recipe addresses a specific operational concern with ready-to-use PromQL queries, alert rules, and dashboard configurations.

Recipe: Basic Health Monitoring

Goal: Know immediately when something is wrong.

Alerts

groups:
  - name: pg-tide-health
    rules:
      - alert: PgTidePipelineUnhealthy
        expr: pg_tide_pipeline_healthy == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pipeline {{ $labels.pipeline }} is unhealthy (circuit breaker open)"

      - alert: PgTideNoActivity
        expr: rate(pg_tide_messages_published_total[10m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pipeline {{ $labels.pipeline }} has published zero messages for 15 minutes"

Dashboard Panel

# Traffic light: 1 = green, 0 = red
pg_tide_pipeline_healthy

Recipe: Throughput Monitoring

Goal: Understand message flow rates and detect anomalies.

Key Queries

# Messages published per second (per pipeline)
rate(pg_tide_messages_published_total[5m])

# Total throughput across all pipelines
sum(rate(pg_tide_messages_published_total[5m]))

# Publish success ratio
1 - (rate(pg_tide_publish_errors_total[5m]) / rate(pg_tide_messages_consumed_total[5m]))

Alert: Throughput Drop

- alert: PgTideThroughputDrop
  expr: |
    rate(pg_tide_messages_published_total[5m]) 
    < 0.5 * rate(pg_tide_messages_published_total[1h] offset 1d)
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} throughput dropped >50% vs yesterday"

Recipe: Latency Monitoring

Goal: Ensure messages are delivered within acceptable time bounds.

Key Queries

# P50 delivery latency
histogram_quantile(0.5, rate(pg_tide_delivery_latency_seconds_bucket[5m]))

# P99 delivery latency
histogram_quantile(0.99, rate(pg_tide_delivery_latency_seconds_bucket[5m]))

# Percentage of messages delivered within 1 second
sum(rate(pg_tide_delivery_latency_seconds_bucket{le="1.0"}[5m]))
/ sum(rate(pg_tide_delivery_latency_seconds_count[5m]))

Alert: High Latency

- alert: PgTideHighLatency
  expr: histogram_quantile(0.99, rate(pg_tide_delivery_latency_seconds_bucket[5m])) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} P99 latency exceeds 5 seconds"

Recipe: Consumer Lag Monitoring

Goal: Detect growing backlogs before they become critical.

Key Queries

# Current lag (pending messages)
pg_tide_consumer_lag

# Lag growth rate (positive = growing, negative = draining)
deriv(pg_tide_consumer_lag[5m])

# Estimated time to drain at current rate
pg_tide_consumer_lag / rate(pg_tide_messages_published_total[5m])

Alert: Growing Lag

- alert: PgTideGrowingLag
  expr: pg_tide_consumer_lag > 10000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has {{ $value }} pending messages"

- alert: PgTideCriticalLag
  expr: pg_tide_consumer_lag > 100000
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has critical backlog: {{ $value }} messages"

Recipe: Error Rate Monitoring

Goal: Detect delivery problems early.

Key Queries

# Errors per second
rate(pg_tide_publish_errors_total[5m])

# Error ratio (errors / total consumed)
rate(pg_tide_publish_errors_total[5m]) / rate(pg_tide_messages_consumed_total[5m])

Alert: Error Spike

- alert: PgTideErrorSpike
  expr: rate(pg_tide_publish_errors_total[5m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has sustained errors: {{ $value }}/s"

Recipe: Dead Letter Queue Monitoring

Goal: Track messages that failed permanently and need attention.

SQL Query (for custom exporter or pg_stat_monitor)

-- Unresolved DLQ entries by pipeline
SELECT pipeline_name, count(*) as unresolved
FROM tide.relay_dlq
WHERE resolved_at IS NULL
GROUP BY pipeline_name;

Alert (via SQL-based exporter)

- alert: PgTideDLQGrowing
  expr: pg_tide_dlq_unresolved > 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has {{ $value }} unresolved DLQ entries"

Recipe: Resource Monitoring

Goal: Ensure relay processes have adequate resources.

Key Queries (standard node/container metrics)

# CPU usage per relay pod
rate(container_cpu_usage_seconds_total{container="pg-tide"}[5m])

# Memory usage per relay pod
container_memory_working_set_bytes{container="pg-tide"}

# PostgreSQL active connections from relay
pg_stat_activity_count{application_name="pg-tide"}

Runbook Reference

Alert	First Response	Escalation
PipelineUnhealthy	Check sink availability, review error logs	Restart relay if stuck
ThroughputDrop	Check source (outbox empty?), check sink (slow?)	Scale relay instances
HighLatency	Check batch size, check sink response time	Increase batch size or add instances
GrowingLag	Check relay health, check for slow transforms	Increase batch size, add instances
ErrorSpike	Check DLQ for error details, check sink logs	Fix root cause, replay DLQ
DLQGrowing	Inspect DLQ entries, identify error pattern	Fix issue, replay messages

pg_tide Documentation

Monitoring Cookbook

Recipe: Basic Health Monitoring

Alerts

Dashboard Panel

Recipe: Throughput Monitoring

Key Queries

Alert: Throughput Drop

Recipe: Latency Monitoring

Key Queries

Alert: High Latency

Recipe: Consumer Lag Monitoring

Key Queries

Alert: Growing Lag

Recipe: Error Rate Monitoring

Key Queries

Alert: Error Spike

Recipe: Dead Letter Queue Monitoring

SQL Query (for custom exporter or pg_stat_monitor)

Alert (via SQL-based exporter)

Recipe: Resource Monitoring

Key Queries (standard node/container metrics)

Runbook Reference

Further Reading

Keyboard shortcuts

pg_tide Documentation