Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

BigQuery

Google BigQuery is a serverless, highly scalable data warehouse that can analyze petabytes of data with standard SQL. Unlike traditional warehouses that require provisioning, BigQuery separates storage from compute and charges only for queries run and data stored. When pg_tide delivers messages to BigQuery, your PostgreSQL events become immediately available for analytics, machine learning (via BigQuery ML), and visualization in Looker or Data Studio.

When to Use This Sink

Choose BigQuery when your analytics stack runs on Google Cloud, when you want truly serverless analytics with no capacity planning, or when you need to combine PostgreSQL event data with other GCP datasets. BigQuery's streaming insert API makes events queryable within seconds of delivery, and its columnar storage format provides excellent compression and query performance for analytical workloads.

Configuration

SELECT tide.relay_set_outbox(
    'events-to-bigquery',
    'events',
    'bq-relay',
    '{
        "sink_type": "bigquery",
        "project_id": "${env:GCP_PROJECT_ID}",
        "dataset": "event_analytics",
        "table": "raw_events",
        "credentials_json": "${file:/etc/gcp/service-account.json}",
        "batch_size": 500
    }'::jsonb
);

Configuration Reference

ParameterTypeDefaultDescription
sink_typestringMust be "bigquery"
project_idstringGCP project ID
datasetstringBigQuery dataset name
tablestringBigQuery table name
credentials_jsonstringnullService account JSON (falls back to ADC)
batch_sizeint500Rows per streaming insert batch (max 10,000)
insert_methodstring"streaming"Insert method: "streaming" or "load"

Insert Methods

  • Streaming inserts (default): Events are queryable within seconds. Best for real-time analytics. Incurs streaming insert pricing.
  • Load jobs: Events are batched into files and loaded periodically. Lower cost but higher latency (minutes). Best for cost-sensitive batch analytics.

Table Schema

CREATE TABLE event_analytics.raw_events (
    event_id STRING,
    outbox_name STRING,
    payload JSON,
    dedup_key STRING,
    operation STRING,
    published_at TIMESTAMP
)
PARTITION BY DATE(published_at)
CLUSTER BY outbox_name, operation;

Partitioning by date and clustering by common filter columns optimizes both cost and query performance.

Troubleshooting

  • "Access Denied" — Service account needs roles/bigquery.dataEditor on the dataset
  • "Table not found" — Create the table and dataset before starting the pipeline
  • "Streaming insert quota exceeded" — BigQuery has per-table streaming limits; use load jobs for very high volumes
  • "Invalid rows" — Schema mismatch between event structure and table schema

Further Reading