Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Apache Arrow Flight

Apache Arrow Flight is a high-performance RPC framework for transferring large datasets between systems using the Apache Arrow columnar memory format. Unlike JSON-based protocols that require serialization/deserialization, Arrow Flight transfers data in-memory columnar format over gRPC, achieving throughput measured in gigabytes per second. When pg_tide delivers messages via Arrow Flight, your events are batched into Arrow record batches and streamed to any Arrow Flight-compatible endpoint.

When to Use This Sink

Choose Arrow Flight when you need maximum throughput for analytical workloads (machine learning pipelines, real-time feature stores, analytics engines), when the receiving system supports Arrow natively (DuckDB, DataFusion, Polars, pandas, many ML frameworks), or when you want to minimize serialization overhead for high-volume data transfer. Arrow Flight is particularly effective for scenarios where events are consumed in batches for computation rather than processed individually.

Configuration

SELECT tide.relay_set_outbox(
    'events-to-flight',
    'events',
    'flight-relay',
    '{
        "sink_type": "arrow_flight",
        "endpoint": "grpc://${env:FLIGHT_HOST}:8815",
        "batch_size": 5000,
        "tls_enabled": false
    }'::jsonb
);

Configuration Reference

ParameterTypeDefaultDescription
sink_typestringMust be "arrow_flight"
endpointstringgRPC endpoint URL
batch_sizeint1000Records per Arrow record batch
tls_enabledboolfalseEnable TLS for gRPC
auth_tokenstringnullBearer token for authentication

How It Works

Messages are accumulated into batches and converted to Arrow columnar format (record batches). The relay then streams these record batches to the Flight endpoint using gRPC's DoPut RPC. This approach is dramatically more efficient than JSON-over-HTTP for large batch transfers because:

  1. Arrow's columnar format enables zero-copy reads on the receiver side
  2. gRPC streaming amortizes connection overhead across many records
  3. Arrow's type system preserves data types without string conversion

Troubleshooting

  • "Connection failed" — Verify the gRPC endpoint is reachable and the port is correct
  • "Unauthenticated" — Set auth_token if the Flight server requires authentication
  • Low throughput — Increase batch_size; Arrow Flight is most efficient with large batches

Further Reading