Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Capacity Planning

This guide helps you estimate the resources needed for pg_tide based on your workload characteristics.

Key Dimensions

Capacity planning for pg_tide involves three systems: PostgreSQL (where outbox/inbox tables live), the relay process (CPU and memory), and the network (bandwidth to sinks).

PostgreSQL

The outbox table is the primary bottleneck for most deployments. Key factors:

FactorImpactMitigation
Write throughputINSERT rate into outbox tablesConnection pooling, partitioning
Table sizeUnrelayed rows waiting for deliveryTune batch size and poll interval
Index maintenanceOutbox has sequential ID indexMinimal — append-only workload
Disk I/OWAL writes for each INSERTFast storage, WAL tuning

Rule of thumb: A single PostgreSQL instance handles 10,000-50,000 outbox inserts/second depending on row size and hardware. The relay processes rows faster than most applications can generate them.

Relay Process

The relay is CPU-light and memory-light for most workloads:

WorkloadCPUMemoryBottleneck
1,000 msg/s, JSON, Kafka0.1 core50 MBNetwork to Kafka
10,000 msg/s, JSON, Kafka0.5 core100 MBKafka ack latency
10,000 msg/s, Avro, Schema Registry1 core200 MBAvro serialization
50,000 msg/s, JSON, NATS0.3 core80 MBNetwork
1,000 msg/s, HTTP webhook0.1 core50 MBWebhook response time

Rule of thumb: Start with 0.5 CPU and 256 MB memory. Monitor actual usage and adjust.

Network

Bandwidth depends on message size and throughput:

Bandwidth = messages_per_second × average_message_size_bytes

Example: 10,000 msg/s × 1 KB/msg = 10 MB/s = 80 Mbps

Sizing Formulas

Outbox Table Growth

If the relay is down (or slower than production), the outbox grows:

Rows pending = (insert_rate - relay_rate) × downtime_seconds
Disk usage = rows_pending × average_row_size

Example: 5,000 inserts/s with relay down for 5 minutes:

  • Rows: 5,000 × 300 = 1,500,000
  • Disk: 1,500,000 × 500 bytes = 750 MB

Consumer Lag Recovery Time

After an outage, how long to drain the backlog:

Recovery time = pending_rows / (relay_rate - insert_rate)

Example: 1.5M pending rows, relay at 20,000/s, inserts at 5,000/s:

  • Recovery: 1,500,000 / 15,000 = 100 seconds

Relay Instance Count

For active-active HA with balanced load:

Instances = ceil(total_pipelines / pipelines_per_instance)

Most pipelines consume negligible resources. Start with 2 instances (for HA) and scale based on actual throughput needs.

Batch Size Tuning

Batch size affects both throughput and latency:

Batch SizeThroughputLatencyUse Case
1LowestLowestReal-time notifications
10-50MediumLowGeneral event streaming
100-500HighMediumAnalytics, data lake loading
1000+HighestHigherBulk ETL, backfill

Configure per-pipeline:

{ "batch_size": 100 }

Or set a process-wide default:

pg-tide --default-batch-size 100

PostgreSQL Configuration

Key PostgreSQL settings for outbox-heavy workloads:

-- Connection handling
max_connections = 200          -- Enough for app + relay + monitoring
shared_buffers = '4GB'         -- 25% of RAM

-- WAL configuration (important for high-insert workloads)
wal_buffers = '64MB'
max_wal_size = '4GB'
checkpoint_completion_target = 0.9

-- Vacuuming (outbox rows are deleted after relay)
autovacuum_vacuum_scale_factor = 0.01  -- Vacuum more aggressively
autovacuum_naptime = '10s'

Monitoring for Capacity

Set alerts on these metrics to detect capacity issues early:

MetricWarning ThresholdAction
pg_tide_consumer_lag> 10,000Increase batch size or add relay instances
CPU usage (relay)> 70% sustainedAdd CPU or split pipelines
PostgreSQL connections> 80% of maxIncrease max_connections or use pgBouncer
Disk usage growth> 1 GB/hour unrelayedInvestigate relay health

Further Reading