Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring Cookbook

Practical recipes for monitoring pg_tide in production. Each recipe addresses a specific operational concern with ready-to-use PromQL queries, alert rules, and dashboard configurations.

Recipe: Basic Health Monitoring

Goal: Know immediately when something is wrong.

Alerts

groups:
  - name: pg-tide-health
    rules:
      - alert: PgTidePipelineUnhealthy
        expr: pg_tide_pipeline_healthy == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pipeline {{ $labels.pipeline }} is unhealthy (circuit breaker open)"

      - alert: PgTideNoActivity
        expr: rate(pg_tide_messages_published_total[10m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pipeline {{ $labels.pipeline }} has published zero messages for 15 minutes"

Dashboard Panel

# Traffic light: 1 = green, 0 = red
pg_tide_pipeline_healthy

Recipe: Throughput Monitoring

Goal: Understand message flow rates and detect anomalies.

Key Queries

# Messages published per second (per pipeline)
rate(pg_tide_messages_published_total[5m])

# Total throughput across all pipelines
sum(rate(pg_tide_messages_published_total[5m]))

# Publish success ratio
1 - (rate(pg_tide_publish_errors_total[5m]) / rate(pg_tide_messages_consumed_total[5m]))

Alert: Throughput Drop

- alert: PgTideThroughputDrop
  expr: |
    rate(pg_tide_messages_published_total[5m]) 
    < 0.5 * rate(pg_tide_messages_published_total[1h] offset 1d)
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} throughput dropped >50% vs yesterday"

Recipe: Latency Monitoring

Goal: Ensure messages are delivered within acceptable time bounds.

Key Queries

# P50 delivery latency
histogram_quantile(0.5, rate(pg_tide_delivery_latency_seconds_bucket[5m]))

# P99 delivery latency
histogram_quantile(0.99, rate(pg_tide_delivery_latency_seconds_bucket[5m]))

# Percentage of messages delivered within 1 second
sum(rate(pg_tide_delivery_latency_seconds_bucket{le="1.0"}[5m]))
/ sum(rate(pg_tide_delivery_latency_seconds_count[5m]))

Alert: High Latency

- alert: PgTideHighLatency
  expr: histogram_quantile(0.99, rate(pg_tide_delivery_latency_seconds_bucket[5m])) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} P99 latency exceeds 5 seconds"

Recipe: Consumer Lag Monitoring

Goal: Detect growing backlogs before they become critical.

Key Queries

# Current lag (pending messages)
pg_tide_consumer_lag

# Lag growth rate (positive = growing, negative = draining)
deriv(pg_tide_consumer_lag[5m])

# Estimated time to drain at current rate
pg_tide_consumer_lag / rate(pg_tide_messages_published_total[5m])

Alert: Growing Lag

- alert: PgTideGrowingLag
  expr: pg_tide_consumer_lag > 10000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has {{ $value }} pending messages"

- alert: PgTideCriticalLag
  expr: pg_tide_consumer_lag > 100000
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has critical backlog: {{ $value }} messages"

Recipe: Error Rate Monitoring

Goal: Detect delivery problems early.

Key Queries

# Errors per second
rate(pg_tide_publish_errors_total[5m])

# Error ratio (errors / total consumed)
rate(pg_tide_publish_errors_total[5m]) / rate(pg_tide_messages_consumed_total[5m])

Alert: Error Spike

- alert: PgTideErrorSpike
  expr: rate(pg_tide_publish_errors_total[5m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has sustained errors: {{ $value }}/s"

Recipe: Dead Letter Queue Monitoring

Goal: Track messages that failed permanently and need attention.

SQL Query (for custom exporter or pg_stat_monitor)

-- Unresolved DLQ entries by pipeline
SELECT pipeline_name, count(*) as unresolved
FROM tide.relay_dlq
WHERE resolved_at IS NULL
GROUP BY pipeline_name;

Alert (via SQL-based exporter)

- alert: PgTideDLQGrowing
  expr: pg_tide_dlq_unresolved > 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} has {{ $value }} unresolved DLQ entries"

Recipe: Resource Monitoring

Goal: Ensure relay processes have adequate resources.

Key Queries (standard node/container metrics)

# CPU usage per relay pod
rate(container_cpu_usage_seconds_total{container="pg-tide"}[5m])

# Memory usage per relay pod
container_memory_working_set_bytes{container="pg-tide"}

# PostgreSQL active connections from relay
pg_stat_activity_count{application_name="pg-tide"}

Runbook Reference

AlertFirst ResponseEscalation
PipelineUnhealthyCheck sink availability, review error logsRestart relay if stuck
ThroughputDropCheck source (outbox empty?), check sink (slow?)Scale relay instances
HighLatencyCheck batch size, check sink response timeIncrease batch size or add instances
GrowingLagCheck relay health, check for slow transformsIncrease batch size, add instances
ErrorSpikeCheck DLQ for error details, check sink logsFix root cause, replay DLQ
DLQGrowingInspect DLQ entries, identify error patternFix issue, replay messages

Further Reading