Error Handling

The pg-tide relay is designed to be resilient in the face of failures. This page covers how errors are categorized and handled at each stage of the pipeline, the retry strategy, graceful shutdown behavior, dead-letter queue management, and a complete reference of all error codes.

Error Philosophy

pg_tide's error handling follows a simple principle: transient errors are retried, permanent errors are logged and skipped. The relay never silently drops messages — every error is logged, counted in Prometheus metrics, and (for inbox-side failures) tracked in the dead-letter queue.

The relay distinguishes between:

Transient errors — network timeouts, temporary unavailability, connection resets. These will succeed if retried.
Permanent errors — malformed payloads, deserialization failures, invalid configuration. These will never succeed regardless of retries.

Retry Strategy

All transient errors trigger exponential backoff retry with jitter:

Parameter	Value	Purpose
Initial delay	100ms	Start retrying quickly for brief hiccups
Maximum delay	30 seconds	Cap the backoff to avoid minute-long waits
Jitter	±20%	Prevent thundering herd when multiple relays reconnect simultaneously
Maximum retries	Unlimited	The relay retries forever for transient errors — messages are never lost
Backoff multiplier	2×	Each retry doubles the delay (100ms → 200ms → 400ms → ...)

The backoff sequence looks like: 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 25.6s, 30s, 30s, 30s...

Jitter randomizes each delay by ±20%, so the actual sequence might be: 85ms, 220ms, 350ms, 900ms, etc. This prevents synchronized retry storms.

Error Categories

Connection errors (PostgreSQL)

Symptoms: Relay logs "PostgreSQL connection failed, retrying" or "postgres error"

What happens:

The relay logs a warning with connection details
Enters reconnection mode with exponential backoff (100ms → 30s)
All pipelines are paused (they can't function without the database)
On reconnect, advisory locks are re-acquired
Pipeline processing resumes from the last committed offset
No messages are lost — they remain pending in the outbox

Common causes:

PostgreSQL is restarting or failing over
Network partition between relay and database
Connection pool exhaustion
Authentication failure (password rotation)

Resolution: Usually self-healing. The relay reconnects automatically when PostgreSQL is available again. If the issue is persistent (auth failure), fix the credentials and the relay will reconnect on its next attempt.

Sink errors (delivery failures)

Symptoms: Relay logs "sink publish error" or "sink unhealthy", Prometheus counter pg_tide_relay_publish_errors_total increases.

What happens:

Messages remain pending in the outbox (they are never lost)
The relay retries delivery with exponential backoff until the sink recovers
Prometheus metrics track pg_tide_relay_publish_errors_total{pipeline="..."}
The health endpoint reports unhealthy (503) for affected pipelines
Once the sink recovers, delivery resumes automatically

Common causes:

Downstream system (Kafka, NATS, webhook endpoint) is temporarily unavailable
Network issues between relay and sink
Sink is overloaded and rejecting new messages (backpressure)
TLS certificate issues

Resolution: Usually self-healing. Monitor the error rate and investigate if it persists beyond expected maintenance windows.

Source errors (reverse mode)

Symptoms: Relay logs "source poll error" for reverse pipelines.

What happens:

The relay retries subscription/polling with exponential backoff
Once reconnected, consumption resumes from the last acknowledged position
No messages are skipped (the source tracks its own offset)

Common causes:

External source (NATS, Kafka, SQS) is temporarily unavailable
Subscription expired or was revoked
Consumer group rebalancing (Kafka)

Payload errors (permanent)

Symptoms: Relay logs "payload decode error" or "unsupported outbox payload version"

What happens:

The error is logged with full context (outbox name, message ID, raw payload excerpt)
The message is skipped — it will never succeed regardless of retries
The offset advances past the bad message
Prometheus tracks the error count

Common causes:

Application published malformed JSONB that the relay cannot interpret
Message format version mismatch (relay expects v2, message is v1)
Corruption (extremely rare)

Resolution: Investigate the specific message. Fix the publishing code if it's generating invalid payloads. For format mismatches, upgrade the relay or add backward-compatible handling.

Configuration errors

Symptoms: Relay logs "config error" or "invalid config for pipeline" at startup or after hot-reload.

What happens:

If the error is in the TOML file, the relay refuses to start
If the error is in a pipeline config (in PostgreSQL), that specific pipeline is skipped
Other pipelines continue to operate normally

Common causes:

Missing required config key (e.g., no brokers for Kafka sink)
Invalid value (e.g., non-numeric batch_size)
Unsupported backend name

Resolution: Fix the configuration. For pipeline configs, update the JSONB in the database and the relay will pick up the correction via hot-reload.

Graceful Shutdown

When the relay receives SIGTERM or SIGINT:

Stop accepting new work — no new batches are fetched from the outbox
Drain in-flight messages — wait for currently-delivering batches to complete (up to a drain timeout)
Commit final offsets — record the last successfully delivered position
Release advisory locks — allow other relay instances to take over immediately
Close connections — cleanly disconnect from PostgreSQL and sinks
Exit with code 0 — signal success to the process manager

The drain timeout prevents the relay from hanging indefinitely if a sink is unresponsive during shutdown. Messages that weren't committed will be re-delivered by the next relay instance (and deduplicated by the inbox if applicable).

Dead-Letter Queue (Inbox Side)

For reverse pipelines that write to inboxes, messages that fail processing are managed through the inbox's built-in DLQ mechanism.

How messages enter the DLQ

Your application reads a message from the inbox and attempts to process it
Processing fails (external API timeout, validation error, business rule violation)
You call tide.inbox_mark_failed(inbox_name, event_id, error_message)
The message's retry_count is incremented and last_error is recorded
After max_retries failures, the message is effectively dead-lettered

Querying the DLQ

-- Find all dead-lettered messages in an inbox
SELECT event_id, payload, retry_count, last_error, received_at
FROM tide."my-inbox_inbox"
WHERE processed_at IS NULL
  AND retry_count >= 5  -- assuming max_retries = 5
ORDER BY received_at;

Investigating failures

-- Group DLQ messages by error pattern
SELECT
  left(last_error, 50) AS error_pattern,
  count(*) AS message_count,
  min(received_at) AS earliest,
  max(received_at) AS latest
FROM tide."my-inbox_inbox"
WHERE processed_at IS NULL AND retry_count >= 5
GROUP BY left(last_error, 50)
ORDER BY message_count DESC;

Replaying messages

After fixing the underlying issue, replay specific messages or all DLQ messages:

-- Replay specific messages
SELECT tide.replay_inbox_messages('my-inbox',
  ARRAY['evt-001', 'evt-002', 'evt-003']);

-- Replay all DLQ messages for an inbox
SELECT tide.replay_inbox_messages('my-inbox',
  (SELECT array_agg(event_id)
   FROM tide."my-inbox_inbox"
   WHERE processed_at IS NULL AND retry_count >= 5)
);

Replaying resets retry_count to 0, making messages eligible for normal processing again.

Extension Error Reference

Errors raised by pg_tide SQL functions:

Error message	Raised by	What it means
`outbox already exists: {name}`	`outbox_create`	An outbox with this name already exists. Use `p_if_not_exists := true` to suppress.
`outbox not found: {name}`	`outbox_publish`, `outbox_drop`, `outbox_status`, `outbox_enable/disable`	No outbox with this name exists. Create it first with `outbox_create`.
`inbox already exists: {name}`	`inbox_create`	An inbox with this name already exists.
`inbox not found: {name}`	`inbox_drop`, `inbox_mark_processed/failed`, `inbox_status`	No inbox with this name exists.
`relay pipeline not found: {name}`	`relay_enable/disable/delete/get_config`	No pipeline with this name in the catalog.
`invalid argument: {details}`	Various	A parameter value is invalid (e.g., negative retention_hours).
`SPI error: {details}`	Various	Internal database error during SPI execution.

Handling extension errors in PL/pgSQL

DO $$
BEGIN
  PERFORM tide.outbox_publish('maybe-missing', '{}'::jsonb, '{}'::jsonb);
EXCEPTION
  WHEN OTHERS THEN
    RAISE NOTICE 'Publish failed: %', SQLERRM;
    -- Handle gracefully: log, retry, or use a fallback
END $$;

Relay Error Reference

Errors logged by the pg-tide relay binary:

Error	Category	What it means	Self-healing?
`postgres error`	Connection	Database communication failure	✓ (reconnects)
`postgres connection failed`	Connection	Cannot reach PostgreSQL	✓ (retries)
`config error`	Configuration	Invalid TOML or missing field	✗ (fix config)
`invalid config for pipeline`	Configuration	Pipeline JSONB validation failure	✗ (fix SQL config)
`pipeline not found`	Configuration	Referenced pipeline doesn't exist	✗ (create pipeline)
`missing required config key`	Configuration	A required backend config key is missing	✗ (fix SQL config)
`unsupported outbox payload version`	Payload	Message format version mismatch	✗ (upgrade relay or fix publisher)
`payload decode error`	Payload	Cannot deserialize message	✗ (fix publisher)
`sink publish error`	Delivery	Sink rejected or timed out	✓ (retries)
`sink unhealthy`	Delivery	Sink not accepting connections	✓ (retries)
`source poll error`	Ingestion	Source read failure	✓ (retries)
`channel closed`	Internal	Internal communication channel dropped	✓ (relay recovers)

Monitoring Errors

Prometheus metrics for error tracking

# Total delivery errors by pipeline (should be 0 in steady state)
rate(pg_tide_relay_publish_errors_total[5m])

# Unhealthy pipelines (immediate alert)
pg_tide_relay_pipeline_healthy == 0

# Error rate as a percentage of total deliveries
rate(pg_tide_relay_publish_errors_total[5m])
  / rate(pg_tide_relay_messages_published_total[5m])

Alerting rules

- alert: PgTideDeliveryErrors
  expr: rate(pg_tide_relay_publish_errors_total[5m]) > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Delivery errors on pipeline {{ $labels.pipeline }}"
    description: "The relay is experiencing delivery failures. Check sink availability."

- alert: PgTidePipelineDown
  expr: pg_tide_relay_pipeline_healthy == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} is unhealthy"
    description: "Immediate investigation required. Messages are accumulating."

Log-based monitoring

With structured JSON logging (--log-format json), you can filter and alert on error logs:

{"level":"error","pipeline":"orders-to-kafka","error":"sink publish error: BrokerNotAvailable","msg":"delivery failed, will retry","timestamp":"2025-01-15T10:30:00Z"}

Key fields to monitor:

level=error — any error-level log indicates a problem
pipeline — identifies which pipeline is affected
error — the specific error message for diagnosis

Keyboard shortcuts

pg_tide Documentation