Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Object Storage (S3 / GCS / Azure Blob)

Object storage services like Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage provide virtually unlimited, highly durable data storage at very low cost. When pg_tide delivers messages to object storage, your events are written as files (JSONL or Parquet format) organized in a path structure you define. This creates a data lake landing zone that can be queried by tools like Athena, BigQuery, Trino, DuckDB, or Spark without requiring a dedicated streaming infrastructure.

When to Use This Sink

Choose object storage when you need cost-effective long-term archival of events, when you want to build a data lake without committing to a specific table format (Iceberg/Delta), when compliance requires immutable event retention, or when your analytical tools can query files directly from object storage. Object storage is the most cost-effective option for high-volume event archival.

Configuration

SELECT tide.relay_set_outbox(
    'events-to-s3',
    'events',
    's3-relay',
    '{
        "sink_type": "object_storage",
        "provider": "s3",
        "bucket": "my-data-lake",
        "prefix": "events/{stream_table}/year={year}/month={month}/day={day}/",
        "format": "parquet",
        "region": "us-east-1",
        "batch_size": 1000,
        "file_rotation_seconds": 300
    }'::jsonb
);

Configuration Reference

ParameterTypeDefaultDescription
sink_typestringMust be "object_storage"
providerstringStorage provider: "s3", "gcs", "azure"
bucketstringBucket/container name
prefixstring""Path prefix template. Supports {stream_table}, {year}, {month}, {day}, {hour}
formatstring"jsonl"File format: "jsonl" or "parquet"
regionstringnullCloud region
access_key_idstringnullAccess key (falls back to default credential chain)
secret_access_keystringnullSecret key
batch_sizeint1000Records per file
file_rotation_secondsint300Maximum seconds before a file is finalized and uploaded
compressionstringnullCompression for JSONL: "gzip", "zstd"

File Formats

JSONL (JSON Lines)

One JSON object per line. Human-readable, easy to process with standard tools:

{"event_id":"abc","payload":{"order_id":"ord-1"},"op":"insert","ts":"2024-01-15T10:30:00Z"}
{"event_id":"def","payload":{"order_id":"ord-2"},"op":"insert","ts":"2024-01-15T10:30:01Z"}

Parquet

Columnar binary format optimized for analytics. 10-50x compression vs. raw JSON, and dramatically faster query performance for analytical workloads. Use Parquet when the data will be queried by analytics engines.

Path Partitioning

The prefix template creates a Hive-style partitioned layout that analytical engines recognize automatically:

s3://my-lake/events/orders/year=2024/month=01/day=15/batch-001.parquet
s3://my-lake/events/orders/year=2024/month=01/day=15/batch-002.parquet

This enables partition pruning — queries that filter by date only read the relevant files, dramatically reducing scan costs.

Troubleshooting

  • "Access Denied" — Check IAM permissions: s3:PutObject on the bucket/prefix
  • "Bucket not found" — Verify bucket name and region
  • Large files / memory pressure — Reduce batch_size or file_rotation_seconds
  • Query engine can't read files — Ensure format matches what the engine expects; check compression codec

Further Reading

  • Apache Iceberg — Add ACID transactions and time travel on top of object storage
  • Delta Lake — Alternative table format for object storage
  • Snowflake — Query object storage files from Snowflake external tables