Troubleshooting Guide

A comprehensive guide to diagnosing and resolving common pg_tide issues in production.

Quick Diagnostic Checklist

When something isn't working:

Check relay logs — Look for ERROR or WARN messages
Check /health endpoint — Returns 503 if circuit breaker is open
Check consumer lag — pg_tide_consumer_lag metric or query outbox directly
Check DLQ — SELECT count(*) FROM tide.relay_dlq WHERE resolved_at IS NULL
Check advisory locks — SELECT * FROM pg_locks WHERE locktype = 'advisory'

Messages Not Being Delivered

Symptom: Outbox rows accumulate, nothing reaches the sink

Check 1: Is the relay running?

# Check if the process is alive
ps aux | grep pg-tide

# Check Kubernetes
kubectl get pods -l app=pg-tide-relay

Check 2: Is the pipeline enabled?

SELECT name, enabled FROM tide.relay_outbox_config;

Check 3: Does the relay hold the advisory lock?

SELECT pid, objid 
FROM pg_locks 
WHERE locktype = 'advisory' AND granted = true;

If no locks are held, the relay may have lost its database connection.

Check 4: Is the circuit breaker open?

curl http://localhost:9090/health
# If unhealthy, circuit breaker is open for one or more pipelines

Check 5: Is the sink reachable? Test connectivity from the relay host to the sink (Kafka broker, HTTP endpoint, etc.).

Symptom: Messages delivered but arriving slowly

Check 1: Batch size too small? Small batch sizes (1-10) increase per-message overhead. Try increasing to 100+.

Check 2: Rate limiter configured?

SELECT config->'rate_limit' FROM tide.relay_outbox_config WHERE name = 'your-pipeline';

Check 3: Sink response time? Check pg_tide_delivery_latency_seconds — if high, the sink is slow to acknowledge.

Circuit Breaker Stuck Open

Symptom: Pipeline unhealthy, `/health` returns 503

The circuit breaker opens after failure_threshold consecutive failures. It won't close until probe requests succeed.

Resolution:

Check sink availability (is the Kafka broker up? Is the webhook endpoint responding?)
Check relay logs for the specific error message
Fix the underlying issue
Wait for half_open_timeout — the circuit will probe automatically
If the probe succeeds, the circuit closes and normal flow resumes

Force-close (nuclear option): Restart the relay. Circuit breaker state is in-memory and resets on restart.

Duplicate Messages at Sink

Symptom: Consumers see the same message multiple times

pg_tide guarantees at-least-once delivery. Duplicates can occur when:

The relay publishes to the sink but crashes before acknowledging
Network partition causes timeout after successful publish
Replay mode reprocesses an already-delivered range

Resolution:

Implement idempotent consumers (dedup on outbox_id or dedup_key)
Use the inbox on the receiving side for built-in deduplication
For Kafka, use enable.idempotence=true on the consumer side

DLQ Entries Accumulating

Symptom: `tide.relay_dlq` table growing

Diagnose the error pattern:

SELECT error_kind, error_message, count(*)
FROM tide.relay_dlq
WHERE resolved_at IS NULL
GROUP BY error_kind, error_message
ORDER BY count(*) DESC
LIMIT 10;

Common causes:

Error Kind	Typical Cause	Fix
`decode`	Malformed message in outbox	Fix the producing application
`sink_permanent`	Auth failure, schema mismatch	Update credentials or schema
`inbox_permanent`	Constraint violation	Check inbox table constraints
`max_retries_exceeded`	Transient issue that lasted too long	Fix sink, then replay

After fixing: Replay the DLQ entries:

SELECT tide.relay_dlq_retry_all('your-pipeline');

Connection Issues

Symptom: "connection refused" or "timeout" in logs

PostgreSQL connection:

# Test from relay host
psql "${DATABASE_URL}" -c "SELECT 1"

Common issues:

Connection string wrong (check --postgres-url)
PostgreSQL max_connections reached
Firewall rules blocking relay → database
SSL/TLS certificate issues

Sink connection:

Check DNS resolution from relay host
Check firewall/security group rules
Verify credentials haven't expired
Check TLS certificate validity

Symptom: "too many connections" from PostgreSQL

Each pipeline worker uses one connection. With 50 pipelines, you need 50+ connections.

Resolution:

Increase max_connections in PostgreSQL
Use PgBouncer in front of PostgreSQL
Reduce number of pipelines per relay instance

Transform/Filter Issues

Symptom: Messages being silently dropped

If messages are consumed but never published, a transform filter might be dropping them.

Diagnose:

Check if messages_consumed > messages_published consistently
Check transform configuration:

SELECT config->'transform' FROM tide.relay_outbox_config WHERE name = 'your-pipeline';

Test the filter expression against a sample payload
Temporarily remove the filter to confirm

Symptom: Transform producing unexpected output

Set dry_run: true on the pipeline to see what transforms produce without publishing:

UPDATE tide.relay_outbox_config 
SET config = config || '{"dry_run": true}'::jsonb 
WHERE name = 'your-pipeline';

Check relay logs for the dry-run output, then disable dry-run when satisfied.

Advisory Lock Conflicts

Symptom: Pipeline not being picked up by any relay instance

Check lock ownership:

SELECT l.pid, a.application_name, l.objid
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.locktype = 'advisory';

If a stale connection holds the lock (zombie process), terminate it:

SELECT pg_terminate_backend(<pid>);

Performance Degradation

Symptom: Gradual slowdown over time

Check 1: Table bloat

SELECT relname, n_dead_tup, last_autovacuum
FROM pg_stat_user_tables
WHERE relname LIKE 'outbox%' OR relname LIKE 'inbox%';

If n_dead_tup is high, autovacuum may be falling behind. See Maintenance.

Check 2: Index bloat

SELECT indexrelname, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE relname LIKE 'outbox%';

Check 3: Disk pressure High I/O wait indicates the database can't keep up with the workload.

pg_tide Documentation

Troubleshooting Guide

Quick Diagnostic Checklist

Messages Not Being Delivered

Symptom: Outbox rows accumulate, nothing reaches the sink

Symptom: Messages delivered but arriving slowly

Circuit Breaker Stuck Open

Symptom: Pipeline unhealthy, `/health` returns 503

Duplicate Messages at Sink

Symptom: Consumers see the same message multiple times

DLQ Entries Accumulating

Symptom: `tide.relay_dlq` table growing

Connection Issues

Symptom: "connection refused" or "timeout" in logs

Symptom: "too many connections" from PostgreSQL

Transform/Filter Issues

Symptom: Messages being silently dropped

Symptom: Transform producing unexpected output

Advisory Lock Conflicts

Symptom: Pipeline not being picked up by any relay instance

Performance Degradation

Symptom: Gradual slowdown over time

Further Reading

Keyboard shortcuts

pg_tide Documentation