Error Handling
The pg-tide relay is designed to be resilient in the face of failures. This page covers how errors are categorized and handled at each stage of the pipeline, the retry strategy, graceful shutdown behavior, dead-letter queue management, and a complete reference of all error codes.
Error Philosophy
pg_tide's error handling follows a simple principle: transient errors are retried, permanent errors are logged and skipped. The relay never silently drops messages — every error is logged, counted in Prometheus metrics, and (for inbox-side failures) tracked in the dead-letter queue.
The relay distinguishes between:
- Transient errors — network timeouts, temporary unavailability, connection resets. These will succeed if retried.
- Permanent errors — malformed payloads, deserialization failures, invalid configuration. These will never succeed regardless of retries.
Retry Strategy
All transient errors trigger exponential backoff retry with jitter:
| Parameter | Value | Purpose |
|---|---|---|
| Initial delay | 100ms | Start retrying quickly for brief hiccups |
| Maximum delay | 30 seconds | Cap the backoff to avoid minute-long waits |
| Jitter | ±20% | Prevent thundering herd when multiple relays reconnect simultaneously |
| Maximum retries | Unlimited | The relay retries forever for transient errors — messages are never lost |
| Backoff multiplier | 2× | Each retry doubles the delay (100ms → 200ms → 400ms → ...) |
The backoff sequence looks like: 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 25.6s, 30s, 30s, 30s...
Jitter randomizes each delay by ±20%, so the actual sequence might be: 85ms, 220ms, 350ms, 900ms, etc. This prevents synchronized retry storms.
Error Categories
Connection errors (PostgreSQL)
Symptoms: Relay logs "PostgreSQL connection failed, retrying" or "postgres error"
What happens:
- The relay logs a warning with connection details
- Enters reconnection mode with exponential backoff (100ms → 30s)
- All pipelines are paused (they can't function without the database)
- On reconnect, advisory locks are re-acquired
- Pipeline processing resumes from the last committed offset
- No messages are lost — they remain pending in the outbox
Common causes:
- PostgreSQL is restarting or failing over
- Network partition between relay and database
- Connection pool exhaustion
- Authentication failure (password rotation)
Resolution: Usually self-healing. The relay reconnects automatically when PostgreSQL is available again. If the issue is persistent (auth failure), fix the credentials and the relay will reconnect on its next attempt.
Sink errors (delivery failures)
Symptoms: Relay logs "sink publish error" or "sink unhealthy", Prometheus counter pg_tide_relay_publish_errors_total increases.
What happens:
- Messages remain pending in the outbox (they are never lost)
- The relay retries delivery with exponential backoff until the sink recovers
- Prometheus metrics track
pg_tide_relay_publish_errors_total{pipeline="..."} - The health endpoint reports unhealthy (
503) for affected pipelines - Once the sink recovers, delivery resumes automatically
Common causes:
- Downstream system (Kafka, NATS, webhook endpoint) is temporarily unavailable
- Network issues between relay and sink
- Sink is overloaded and rejecting new messages (backpressure)
- TLS certificate issues
Resolution: Usually self-healing. Monitor the error rate and investigate if it persists beyond expected maintenance windows.
Source errors (reverse mode)
Symptoms: Relay logs "source poll error" for reverse pipelines.
What happens:
- The relay retries subscription/polling with exponential backoff
- Once reconnected, consumption resumes from the last acknowledged position
- No messages are skipped (the source tracks its own offset)
Common causes:
- External source (NATS, Kafka, SQS) is temporarily unavailable
- Subscription expired or was revoked
- Consumer group rebalancing (Kafka)
Payload errors (permanent)
Symptoms: Relay logs "payload decode error" or "unsupported outbox payload version"
What happens:
- The error is logged with full context (outbox name, message ID, raw payload excerpt)
- The message is skipped — it will never succeed regardless of retries
- The offset advances past the bad message
- Prometheus tracks the error count
Common causes:
- Application published malformed JSONB that the relay cannot interpret
- Message format version mismatch (relay expects v2, message is v1)
- Corruption (extremely rare)
Resolution: Investigate the specific message. Fix the publishing code if it's generating invalid payloads. For format mismatches, upgrade the relay or add backward-compatible handling.
Configuration errors
Symptoms: Relay logs "config error" or "invalid config for pipeline" at startup or after hot-reload.
What happens:
- If the error is in the TOML file, the relay refuses to start
- If the error is in a pipeline config (in PostgreSQL), that specific pipeline is skipped
- Other pipelines continue to operate normally
Common causes:
- Missing required config key (e.g., no
brokersfor Kafka sink) - Invalid value (e.g., non-numeric batch_size)
- Unsupported backend name
Resolution: Fix the configuration. For pipeline configs, update the JSONB in the database and the relay will pick up the correction via hot-reload.
Graceful Shutdown
When the relay receives SIGTERM or SIGINT:
- Stop accepting new work — no new batches are fetched from the outbox
- Drain in-flight messages — wait for currently-delivering batches to complete (up to a drain timeout)
- Commit final offsets — record the last successfully delivered position
- Release advisory locks — allow other relay instances to take over immediately
- Close connections — cleanly disconnect from PostgreSQL and sinks
- Exit with code 0 — signal success to the process manager
The drain timeout prevents the relay from hanging indefinitely if a sink is unresponsive during shutdown. Messages that weren't committed will be re-delivered by the next relay instance (and deduplicated by the inbox if applicable).
Dead-Letter Queue (Inbox Side)
For reverse pipelines that write to inboxes, messages that fail processing are managed through the inbox's built-in DLQ mechanism.
How messages enter the DLQ
- Your application reads a message from the inbox and attempts to process it
- Processing fails (external API timeout, validation error, business rule violation)
- You call
tide.inbox_mark_failed(inbox_name, event_id, error_message) - The message's
retry_countis incremented andlast_erroris recorded - After
max_retriesfailures, the message is effectively dead-lettered
Querying the DLQ
-- Find all dead-lettered messages in an inbox
SELECT event_id, payload, retry_count, last_error, received_at
FROM tide."my-inbox_inbox"
WHERE processed_at IS NULL
AND retry_count >= 5 -- assuming max_retries = 5
ORDER BY received_at;
Investigating failures
-- Group DLQ messages by error pattern
SELECT
left(last_error, 50) AS error_pattern,
count(*) AS message_count,
min(received_at) AS earliest,
max(received_at) AS latest
FROM tide."my-inbox_inbox"
WHERE processed_at IS NULL AND retry_count >= 5
GROUP BY left(last_error, 50)
ORDER BY message_count DESC;
Replaying messages
After fixing the underlying issue, replay specific messages or all DLQ messages:
-- Replay specific messages
SELECT tide.replay_inbox_messages('my-inbox',
ARRAY['evt-001', 'evt-002', 'evt-003']);
-- Replay all DLQ messages for an inbox
SELECT tide.replay_inbox_messages('my-inbox',
(SELECT array_agg(event_id)
FROM tide."my-inbox_inbox"
WHERE processed_at IS NULL AND retry_count >= 5)
);
Replaying resets retry_count to 0, making messages eligible for normal processing again.
Extension Error Reference
Errors raised by pg_tide SQL functions:
| Error message | Raised by | What it means |
|---|---|---|
outbox already exists: {name} | outbox_create | An outbox with this name already exists. Use p_if_not_exists := true to suppress. |
outbox not found: {name} | outbox_publish, outbox_drop, outbox_status, outbox_enable/disable | No outbox with this name exists. Create it first with outbox_create. |
inbox already exists: {name} | inbox_create | An inbox with this name already exists. |
inbox not found: {name} | inbox_drop, inbox_mark_processed/failed, inbox_status | No inbox with this name exists. |
relay pipeline not found: {name} | relay_enable/disable/delete/get_config | No pipeline with this name in the catalog. |
invalid argument: {details} | Various | A parameter value is invalid (e.g., negative retention_hours). |
SPI error: {details} | Various | Internal database error during SPI execution. |
Handling extension errors in PL/pgSQL
DO $$
BEGIN
PERFORM tide.outbox_publish('maybe-missing', '{}'::jsonb, '{}'::jsonb);
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE 'Publish failed: %', SQLERRM;
-- Handle gracefully: log, retry, or use a fallback
END $$;
Relay Error Reference
Errors logged by the pg-tide relay binary:
| Error | Category | What it means | Self-healing? |
|---|---|---|---|
postgres error | Connection | Database communication failure | ✓ (reconnects) |
postgres connection failed | Connection | Cannot reach PostgreSQL | ✓ (retries) |
config error | Configuration | Invalid TOML or missing field | ✗ (fix config) |
invalid config for pipeline | Configuration | Pipeline JSONB validation failure | ✗ (fix SQL config) |
pipeline not found | Configuration | Referenced pipeline doesn't exist | ✗ (create pipeline) |
missing required config key | Configuration | A required backend config key is missing | ✗ (fix SQL config) |
unsupported outbox payload version | Payload | Message format version mismatch | ✗ (upgrade relay or fix publisher) |
payload decode error | Payload | Cannot deserialize message | ✗ (fix publisher) |
sink publish error | Delivery | Sink rejected or timed out | ✓ (retries) |
sink unhealthy | Delivery | Sink not accepting connections | ✓ (retries) |
source poll error | Ingestion | Source read failure | ✓ (retries) |
channel closed | Internal | Internal communication channel dropped | ✓ (relay recovers) |
Monitoring Errors
Prometheus metrics for error tracking
# Total delivery errors by pipeline (should be 0 in steady state)
rate(pg_tide_relay_publish_errors_total[5m])
# Unhealthy pipelines (immediate alert)
pg_tide_relay_pipeline_healthy == 0
# Error rate as a percentage of total deliveries
rate(pg_tide_relay_publish_errors_total[5m])
/ rate(pg_tide_relay_messages_published_total[5m])
Alerting rules
- alert: PgTideDeliveryErrors
expr: rate(pg_tide_relay_publish_errors_total[5m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Delivery errors on pipeline {{ $labels.pipeline }}"
description: "The relay is experiencing delivery failures. Check sink availability."
- alert: PgTidePipelineDown
expr: pg_tide_relay_pipeline_healthy == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Pipeline {{ $labels.pipeline }} is unhealthy"
description: "Immediate investigation required. Messages are accumulating."
Log-based monitoring
With structured JSON logging (--log-format json), you can filter and alert on error logs:
{"level":"error","pipeline":"orders-to-kafka","error":"sink publish error: BrokerNotAvailable","msg":"delivery failed, will retry","timestamp":"2025-01-15T10:30:00Z"}
Key fields to monitor:
level=error— any error-level log indicates a problempipeline— identifies which pipeline is affectederror— the specific error message for diagnosis