Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Error Handling

The pg-tide relay is designed to be resilient in the face of failures. This page covers how errors are categorized and handled at each stage of the pipeline, the retry strategy, graceful shutdown behavior, dead-letter queue management, and a complete reference of all error codes.


Error Philosophy

pg_tide's error handling follows a simple principle: transient errors are retried, permanent errors are logged and skipped. The relay never silently drops messages — every error is logged, counted in Prometheus metrics, and (for inbox-side failures) tracked in the dead-letter queue.

The relay distinguishes between:

  • Transient errors — network timeouts, temporary unavailability, connection resets. These will succeed if retried.
  • Permanent errors — malformed payloads, deserialization failures, invalid configuration. These will never succeed regardless of retries.

Retry Strategy

All transient errors trigger exponential backoff retry with jitter:

ParameterValuePurpose
Initial delay100msStart retrying quickly for brief hiccups
Maximum delay30 secondsCap the backoff to avoid minute-long waits
Jitter±20%Prevent thundering herd when multiple relays reconnect simultaneously
Maximum retriesUnlimitedThe relay retries forever for transient errors — messages are never lost
Backoff multiplierEach retry doubles the delay (100ms → 200ms → 400ms → ...)

The backoff sequence looks like: 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 25.6s, 30s, 30s, 30s...

Jitter randomizes each delay by ±20%, so the actual sequence might be: 85ms, 220ms, 350ms, 900ms, etc. This prevents synchronized retry storms.


Error Categories

Connection errors (PostgreSQL)

Symptoms: Relay logs "PostgreSQL connection failed, retrying" or "postgres error"

What happens:

  1. The relay logs a warning with connection details
  2. Enters reconnection mode with exponential backoff (100ms → 30s)
  3. All pipelines are paused (they can't function without the database)
  4. On reconnect, advisory locks are re-acquired
  5. Pipeline processing resumes from the last committed offset
  6. No messages are lost — they remain pending in the outbox

Common causes:

  • PostgreSQL is restarting or failing over
  • Network partition between relay and database
  • Connection pool exhaustion
  • Authentication failure (password rotation)

Resolution: Usually self-healing. The relay reconnects automatically when PostgreSQL is available again. If the issue is persistent (auth failure), fix the credentials and the relay will reconnect on its next attempt.

Sink errors (delivery failures)

Symptoms: Relay logs "sink publish error" or "sink unhealthy", Prometheus counter pg_tide_relay_publish_errors_total increases.

What happens:

  1. Messages remain pending in the outbox (they are never lost)
  2. The relay retries delivery with exponential backoff until the sink recovers
  3. Prometheus metrics track pg_tide_relay_publish_errors_total{pipeline="..."}
  4. The health endpoint reports unhealthy (503) for affected pipelines
  5. Once the sink recovers, delivery resumes automatically

Common causes:

  • Downstream system (Kafka, NATS, webhook endpoint) is temporarily unavailable
  • Network issues between relay and sink
  • Sink is overloaded and rejecting new messages (backpressure)
  • TLS certificate issues

Resolution: Usually self-healing. Monitor the error rate and investigate if it persists beyond expected maintenance windows.

Source errors (reverse mode)

Symptoms: Relay logs "source poll error" for reverse pipelines.

What happens:

  1. The relay retries subscription/polling with exponential backoff
  2. Once reconnected, consumption resumes from the last acknowledged position
  3. No messages are skipped (the source tracks its own offset)

Common causes:

  • External source (NATS, Kafka, SQS) is temporarily unavailable
  • Subscription expired or was revoked
  • Consumer group rebalancing (Kafka)

Payload errors (permanent)

Symptoms: Relay logs "payload decode error" or "unsupported outbox payload version"

What happens:

  1. The error is logged with full context (outbox name, message ID, raw payload excerpt)
  2. The message is skipped — it will never succeed regardless of retries
  3. The offset advances past the bad message
  4. Prometheus tracks the error count

Common causes:

  • Application published malformed JSONB that the relay cannot interpret
  • Message format version mismatch (relay expects v2, message is v1)
  • Corruption (extremely rare)

Resolution: Investigate the specific message. Fix the publishing code if it's generating invalid payloads. For format mismatches, upgrade the relay or add backward-compatible handling.

Configuration errors

Symptoms: Relay logs "config error" or "invalid config for pipeline" at startup or after hot-reload.

What happens:

  1. If the error is in the TOML file, the relay refuses to start
  2. If the error is in a pipeline config (in PostgreSQL), that specific pipeline is skipped
  3. Other pipelines continue to operate normally

Common causes:

  • Missing required config key (e.g., no brokers for Kafka sink)
  • Invalid value (e.g., non-numeric batch_size)
  • Unsupported backend name

Resolution: Fix the configuration. For pipeline configs, update the JSONB in the database and the relay will pick up the correction via hot-reload.


Graceful Shutdown

When the relay receives SIGTERM or SIGINT:

  1. Stop accepting new work — no new batches are fetched from the outbox
  2. Drain in-flight messages — wait for currently-delivering batches to complete (up to a drain timeout)
  3. Commit final offsets — record the last successfully delivered position
  4. Release advisory locks — allow other relay instances to take over immediately
  5. Close connections — cleanly disconnect from PostgreSQL and sinks
  6. Exit with code 0 — signal success to the process manager

The drain timeout prevents the relay from hanging indefinitely if a sink is unresponsive during shutdown. Messages that weren't committed will be re-delivered by the next relay instance (and deduplicated by the inbox if applicable).


Dead-Letter Queue (Inbox Side)

For reverse pipelines that write to inboxes, messages that fail processing are managed through the inbox's built-in DLQ mechanism.

How messages enter the DLQ

  1. Your application reads a message from the inbox and attempts to process it
  2. Processing fails (external API timeout, validation error, business rule violation)
  3. You call tide.inbox_mark_failed(inbox_name, event_id, error_message)
  4. The message's retry_count is incremented and last_error is recorded
  5. After max_retries failures, the message is effectively dead-lettered

Querying the DLQ

-- Find all dead-lettered messages in an inbox
SELECT event_id, payload, retry_count, last_error, received_at
FROM tide."my-inbox_inbox"
WHERE processed_at IS NULL
  AND retry_count >= 5  -- assuming max_retries = 5
ORDER BY received_at;

Investigating failures

-- Group DLQ messages by error pattern
SELECT
  left(last_error, 50) AS error_pattern,
  count(*) AS message_count,
  min(received_at) AS earliest,
  max(received_at) AS latest
FROM tide."my-inbox_inbox"
WHERE processed_at IS NULL AND retry_count >= 5
GROUP BY left(last_error, 50)
ORDER BY message_count DESC;

Replaying messages

After fixing the underlying issue, replay specific messages or all DLQ messages:

-- Replay specific messages
SELECT tide.replay_inbox_messages('my-inbox',
  ARRAY['evt-001', 'evt-002', 'evt-003']);

-- Replay all DLQ messages for an inbox
SELECT tide.replay_inbox_messages('my-inbox',
  (SELECT array_agg(event_id)
   FROM tide."my-inbox_inbox"
   WHERE processed_at IS NULL AND retry_count >= 5)
);

Replaying resets retry_count to 0, making messages eligible for normal processing again.


Extension Error Reference

Errors raised by pg_tide SQL functions:

Error messageRaised byWhat it means
outbox already exists: {name}outbox_createAn outbox with this name already exists. Use p_if_not_exists := true to suppress.
outbox not found: {name}outbox_publish, outbox_drop, outbox_status, outbox_enable/disableNo outbox with this name exists. Create it first with outbox_create.
inbox already exists: {name}inbox_createAn inbox with this name already exists.
inbox not found: {name}inbox_drop, inbox_mark_processed/failed, inbox_statusNo inbox with this name exists.
relay pipeline not found: {name}relay_enable/disable/delete/get_configNo pipeline with this name in the catalog.
invalid argument: {details}VariousA parameter value is invalid (e.g., negative retention_hours).
SPI error: {details}VariousInternal database error during SPI execution.

Handling extension errors in PL/pgSQL

DO $$
BEGIN
  PERFORM tide.outbox_publish('maybe-missing', '{}'::jsonb, '{}'::jsonb);
EXCEPTION
  WHEN OTHERS THEN
    RAISE NOTICE 'Publish failed: %', SQLERRM;
    -- Handle gracefully: log, retry, or use a fallback
END $$;

Relay Error Reference

Errors logged by the pg-tide relay binary:

ErrorCategoryWhat it meansSelf-healing?
postgres errorConnectionDatabase communication failure✓ (reconnects)
postgres connection failedConnectionCannot reach PostgreSQL✓ (retries)
config errorConfigurationInvalid TOML or missing field✗ (fix config)
invalid config for pipelineConfigurationPipeline JSONB validation failure✗ (fix SQL config)
pipeline not foundConfigurationReferenced pipeline doesn't exist✗ (create pipeline)
missing required config keyConfigurationA required backend config key is missing✗ (fix SQL config)
unsupported outbox payload versionPayloadMessage format version mismatch✗ (upgrade relay or fix publisher)
payload decode errorPayloadCannot deserialize message✗ (fix publisher)
sink publish errorDeliverySink rejected or timed out✓ (retries)
sink unhealthyDeliverySink not accepting connections✓ (retries)
source poll errorIngestionSource read failure✓ (retries)
channel closedInternalInternal communication channel dropped✓ (relay recovers)

Monitoring Errors

Prometheus metrics for error tracking

# Total delivery errors by pipeline (should be 0 in steady state)
rate(pg_tide_relay_publish_errors_total[5m])

# Unhealthy pipelines (immediate alert)
pg_tide_relay_pipeline_healthy == 0

# Error rate as a percentage of total deliveries
rate(pg_tide_relay_publish_errors_total[5m])
  / rate(pg_tide_relay_messages_published_total[5m])

Alerting rules

- alert: PgTideDeliveryErrors
  expr: rate(pg_tide_relay_publish_errors_total[5m]) > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Delivery errors on pipeline {{ $labels.pipeline }}"
    description: "The relay is experiencing delivery failures. Check sink availability."

- alert: PgTidePipelineDown
  expr: pg_tide_relay_pipeline_healthy == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} is unhealthy"
    description: "Immediate investigation required. Messages are accumulating."

Log-based monitoring

With structured JSON logging (--log-format json), you can filter and alert on error logs:

{"level":"error","pipeline":"orders-to-kafka","error":"sink publish error: BrokerNotAvailable","msg":"delivery failed, will retry","timestamp":"2025-01-15T10:30:00Z"}

Key fields to monitor:

  • level=error — any error-level log indicates a problem
  • pipeline — identifies which pipeline is affected
  • error — the specific error message for diagnosis